CN115103446A - Multi-user communication anti-interference intelligent decision-making method based on deep reinforcement learning - Google Patents

Multi-user communication anti-interference intelligent decision-making method based on deep reinforcement learning Download PDF

Info

Publication number
CN115103446A
CN115103446A CN202210579127.8A CN202210579127A CN115103446A CN 115103446 A CN115103446 A CN 115103446A CN 202210579127 A CN202210579127 A CN 202210579127A CN 115103446 A CN115103446 A CN 115103446A
Authority
CN
China
Prior art keywords
user
interference
base station
experience
communication
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210579127.8A
Other languages
Chinese (zh)
Inventor
田峰
马亮
张嘉华
吴晓富
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN202210579127.8A priority Critical patent/CN115103446A/en
Publication of CN115103446A publication Critical patent/CN115103446A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W72/00Local resource management
    • H04W72/04Wireless resource allocation
    • H04W72/044Wireless resource allocation based on the type of the allocated resource
    • H04W72/0453Resources in frequency domain, e.g. a carrier in FDMA

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Mobile Radio Communication Systems (AREA)

Abstract

The invention discloses a multi-user communication anti-interference intelligent decision method based on deep reinforcement learning, which comprises the following steps: the method comprises the steps of constructing a multi-user wireless communication anti-interference system model, firstly, using current spectrum information of perceived multi-users and an interference machine as input of a deep reinforcement learning strategy neural network by a base station, then selecting a joint action according to a dynamic greedy algorithm, and helping the users to intelligently select communication frequency bands through feedback of the base station; meanwhile, the immediate reward generated by the current time slot joint action is calculated, and the experience is stored in an experience playback pool. When the experience number in the experience playback pool reaches a given value, randomly extracting a certain number of parameters of the experience updating strategy neural network from the experience pool, and updating the parameters of the target neural network once at fixed time intervals; and repeating the training process to complete the anti-interference intelligent decision-making method for multi-user communication. The invention can realize the anti-interference of multi-user communication and effectively avoid the communication interference caused by an external jammer and an internal user.

Description

Multi-user communication anti-interference intelligent decision-making method based on deep reinforcement learning
Technical Field
The invention relates to a multi-user communication anti-interference intelligent decision method, in particular to a multi-user communication anti-interference intelligent decision method based on deep reinforcement learning.
Background
Interference attacks are problematic in wireless communication networks. In general, a malicious interference signal interrupts normal data reception of a legitimate transmission link, and poses a serious threat to communication security. Furthermore, in a multi-user scenario where there are multiple transmission links, communication performance may degrade even more as the user is subject to external malicious interference and internal co-channel interference. Therefore, the cooperative anti-interference problem of communication still needs to be further researched.
Spread spectrum technology is the mainstream anti-jamming technology for communication, and frequency hopping spread spectrum and direct sequence spread spectrum are widely used. The technologies have obvious anti-interference effects on conventional interference such as frequency sweep interference, pulse interference, broadband blocking interference and the like. However, on the one hand, the conventional anti-interference method has certain limitations, such as: frequency hopping spread spectrum relies on a predetermined frequency hopping pattern and direct sequence spread spectrum relies on a local pseudo random code. On the other hand, with the development of artificial intelligence and software radio technology, new trends of diversity, dynamics, intellectualization and the like of the jammers put higher requirements on communication anti-interference technology.
Disclosure of Invention
The purpose of the invention is as follows: the invention aims to provide a multi-user communication anti-interference intelligent decision method based on deep reinforcement learning, which can effectively cope with external malicious interference and avoid mutual interference among users.
The technical scheme is as follows: the invention discloses a multi-user communication anti-interference intelligent decision-making method, which comprises the following steps of firstly constructing wireless communication anti-interference system models of a plurality of users, and then helping each user to intelligently select an optimal communication frequency band through the feedback of a base station, wherein the method comprises the following steps:
s1, constructing a multi-user wireless communication anti-interference system model consisting of a plurality of users, a base station and an interference machine, wherein the users, the base station and the interference machine are randomly distributed in an open area and share a spectrum space;
s2, the base station acquires the sensed current spectrum information of the multi-user and the jammer;
s3, establishing two convolutional neural network models, taking current spectrum information as input of the convolutional neural network models, then selecting joint action according to a dynamic greedy algorithm, and helping a user to intelligently select a communication frequency band through base station feedback;
s4, calculating the immediate reward generated by the current time slot combined action, and storing the experience into an experience playback pool; the experience comprises a current spectrum selection state, a joint action, an immediate reward and next spectrum selection information;
s5, when the experience number in the experience playback pool reaches a given value, randomly extracting a certain number of experiences from the experience pool, updating the parameters of the strategy neural network, and updating the parameters of the target neural network once at fixed time intervals; and stopping iteration until the set iteration times are reached.
Further, in step S2, the base station receives the SINR of the user u u Judging whether the communication of the user u is successful or not, and if the communication is successful, normalizing the threshold r u (f) Is 1, otherwise is 0;
SINR of user u received by base station u Comprises the following steps:
Figure BDA0003661635280000021
wherein, G u Representing the channel gain, G, of user u to the base station j Indicating the channel gain, U, of the jammer to the base station j (f) Representing the power spectral density of the jammer, f representing the signal frequency, f k Expressed as selecting the center frequency, f, of channel k for user u l Representing the interfering frequencies of a certain user-selected channel/n (f) representing the power spectral density of the noise,
Figure BDA0003661635280000022
representing co-channel interference from other users in the user set when user u selects channel k; delta (. epsilon)To indicate a function, 1 if ε is true, and 0 otherwise;
Figure BDA0003661635280000023
representing a set of users;
definition of beta th For threshold of SNR transmission, when receiving SINR of user u u Greater than beta th When, the transmission is successful; SINR when receiving user u u Is less than or equal to beta th When, the transmission fails; then the threshold g is normalized u (f) Comprises the following steps:
Figure BDA0003661635280000024
further, in step S3, one of the two convolutional neural networks is a policy neural network with a weight parameter θ, and the other is a policy neural network with a weight parameter θ - The target neural network of (2), and randomly initializing weight parameters; two-dimensional frequency spectrum waterfall O t The output after convolution is flattened into one-dimensional data through an unfolding layer after being used as the input of the neural network and passing through four full-connection layers to obtain a final output value;
the joint action a (t) is selected by adopting a dynamic epsilon-greedy algorithm as follows:
at each iteration, the probability of randomly selecting action a (t) is epsilon, and the policy network Q is selected policy Maximum action a ═ argmax a Q policy (O t ,a;θ i ) Has a probability of 1-epsilon, wherein
Figure BDA0003661635280000025
ε 0 The initial greedy probability is obtained, decapay is a decay coefficient, i is the iteration number, epsilon decreases in an exponential level along with the increase of the iteration number i, and e is a natural constant.
Further, the frequency spectrum waterfall O t The solution process of (2) is as follows:
the discrete spectral sample values are defined as follows:
Figure BDA0003661635280000026
wherein, s (f) represents the user power spectral density received by the base station, Δ f is the resolution of the spectral analysis, and i is a positive integer and is used to represent the number of samples;
the result of the spectrum state observed by the base station each time is as follows:
o’ t =[o’ 1,t ,o’ 2,t …o’ L,t ]
then spectrum waterfall O t Is defined as:
O t =[o’ t ,o’ t+1 …o’ t+W-1 ]
wherein W represents the history state number of backtracking, O t The matrix is a two-dimensional matrix with a size of W × L, and includes information of a frequency domain and a time domain.
Further, in step S4, the process of calculating the immediate reward generated by the current time slot action is as follows:
the motion space is represented as:
A={a 1 ,a 2 ,…,a n×m }
where n is the number of users, m is the number of channels, a q Representing the joint action taken by each user at time t, wherein q is 1,2, …, n × m, and n × m joint actions a (t) are total;
the joint action a (t) at time t is:
a(t)=[f 1 (t),f 1 (t),…,f n (t)]
wherein f is n (t) represents the center frequency of the channel selected by the user n at the time t;
the state transition probability means that the user set is in the state O t Transition to state O after next execution of join actions a (t) t+1 Is then expressed as:
P:(O t ,a)→O t+1
for the immediate award r (t), it is defined as:
Figure BDA0003661635280000031
where c is the frequency hopping cost.
Further, in the step S5, after the number of elements in the empirical playback pool D is greater than one batch, Num samples e are randomly selected from the empirical playback pool D batch ={e k ,e k -u (d), k 1,2, Num, the parameter θ of the policy network being performed by a gradient descent algorithm i Carrying out iterative updating; parameters for a target network
Figure BDA0003661635280000032
The parameters of the policy network are regularly copied to realize parameter updating;
after the training is finished, the environmental state O is t Input strategy network calculation to obtain output Q (O) t A; theta) represents the combined action taken by the user, theta represents the weight of the policy network, the action corresponding to the maximum Q value is selected, and the feedback of the base station helps each user to select the optimal communication frequency band for resisting interference.
Compared with the prior art, the invention has the following remarkable effects:
1. in the multi-user communication anti-interference scene, the deep reinforcement learning is used, the traditional reinforcement learning is not limited, and meanwhile, a dynamic epsilon-greedy strategy is adopted, so that the learning rate is improved, and the convergence rate of the algorithm is accelerated;
2. the invention constructs a wireless communication anti-interference system model of a plurality of users, does not limit the random selection of communication frequency bands through a frequency hopping technology, but helps each user to intelligently select the optimal communication frequency band through the feedback of the base station according to the current frequency spectrum state, namely, selects the frequency band with the minimum possibility of interference, can effectively deal with external malicious interference, and avoids internal mutual interference among users.
Drawings
FIG. 1 is a system flow diagram of the present invention;
FIG. 2 is a diagram of a multi-user wireless communication anti-interference system of the present invention;
FIG. 3 is a diagram of a neural network structure for deep reinforcement learning according to the present invention;
FIG. 4(a) is a spectrum diagram of a swept frequency interference mode in an embodiment of the present invention;
FIG. 4(b) is a spectrum diagram of a comb interference pattern in an embodiment of the present invention;
FIG. 4(c) is a spectrum diagram of a dynamic interference pattern in an embodiment of the present invention;
fig. 4(d) is a spectrum diagram of an intelligent interference pattern in an embodiment of the present invention;
FIG. 4(e) is a spectrum diagram of a random interference pattern in an embodiment of the present invention;
fig. 5(a) is a frequency spectrum diagram of an anti-interference model in a frequency sweep interference mode in the embodiment of the present invention;
FIG. 5(b) is a spectrum diagram of an interference rejection model in a comb interference mode according to an embodiment of the present invention;
FIG. 5(c) is a frequency spectrum diagram of an interference rejection model in a dynamic interference mode according to an embodiment of the present invention;
FIG. 5(d) is a spectrum diagram of an interference rejection model in the intelligent interference mode according to an embodiment of the present invention;
FIG. 5(e) is a spectrum diagram of an interference rejection model under a random interference mode in an embodiment of the present invention;
fig. 6 is a graph comparing normalized throughput of each user in dynamic interference mode according to the embodiment of the present invention.
Detailed Description
The invention is described in further detail below with reference to the drawings and the detailed description.
The multi-user communication anti-interference intelligent decision method comprises the steps of firstly constructing a multi-user wireless communication anti-interference system model, firstly enabling a base station to use perceived current spectrum information of multiple users and an interference machine as input of a deep reinforcement learning strategy neural network (selection of channel strategy actions is achieved), then selecting combined actions (namely multi-user communication channel selection) according to a dynamic greedy algorithm, and helping the users to intelligently select communication frequency bands through feedback of the base station. Meanwhile, the immediate reward generated by the current time slot joint action is calculated, and the experience (the current spectrum selection state, the action, the immediate reward and the next spectrum selection information) is stored in an experience playback pool. When the experience number in the experience playback pool reaches a given value, randomly extracting a certain number of parameters of the experience updating strategy neural network from the experience pool, updating the parameters of the target neural network at fixed time intervals, repeating the process, and finally realizing that each user can intelligently select an optimal communication frequency band, effectively coping with external malicious interference and avoiding internal mutual interference among users. The general flow chart is shown in fig. 1, and specifically includes the following steps:
the method comprises the following steps: building a multi-user wireless communication anti-interference system model
As shown in fig. 2, the multi-user wireless communication anti-interference system model of the present invention is composed of a plurality of users, a base station and an interference unit. The user, the base station and the jammer are randomly distributed in an open area and share a spectrum space. The users communicate with the base station respectively, and the user set is represented as:
Figure BDA0003661635280000051
the spectral space is divided into a number of channels, the set of channels being represented as:
Figure BDA0003661635280000052
the number of available channels is m (n)<m), the bandwidth of each channel is b, the transmission power p of the k user k Comprises the following steps:
Figure BDA0003661635280000053
wherein, U k (f) The power spectral density of a signal transmitted for the kth user, n is the number of users, and m is the number of channels.
The channel set of jammer interference is:
J={1,…,j} (4)
where j is the number of interfering channels.
When the channel of the interference is the same as the channel of the user communication, the interference is successful. If two or more users select the same frequency band, mutual interference may result. In order to achieve reliable transmission, both external malicious interference and mutual interference caused by contention between users need to be considered. The base station carries an agent, has the capabilities of spectrum sensing, learning and decision-making, defines the received spectrum information of the current time slot as an environment state after each communication time slot, uses a deep reinforcement learning algorithm to make an anti-interference decision, and informs each user of a communication channel of the current time slot.
Step two: SINR of user u received by base station u Comprises the following steps:
Figure BDA0003661635280000054
wherein, G u Representing the channel gain, G, of a user u to the base station j Indicating the jammer to base station channel gain, U j (f) Representing the power spectral density, f, of the jammer k Expressed as the center frequency of the selected channel k for user u, n (f) the power spectral density of the noise, f the signal frequency, f l Indicating the interfering frequency of a certain user-selected channel/,
Figure BDA0003661635280000055
representing co-channel interference from other users in the user set when user u selects channel k; δ (ε) is an indicator function, which is 1 if ε is true and 0 otherwise.
Definition of beta th As threshold value of SNR transmission, when receiving SINR of user u u Greater than beta th When, the transmission is successful; SINR when receiving user u u Is less than or equal to beta th When transmission fails, i.e. normalizing the threshold g u (f) Comprises the following steps:
Figure BDA0003661635280000056
step three: defining discrete spectral sample values as
Figure BDA0003661635280000057
Wherein S (f) is ∑ k∈N G k U k (f-f k )+G j U j (f-f k ) + n (f) represents the user power spectral density received by the base station, Δ f is the resolution of the spectral analysis, and i is a positive integer and is used to represent the number of samples.
The result of the spectrum state observed by the base station each time is as follows:
o’ t =[o’ 1,t ,o’ 2,t …o’ L,t ] (8)
the environmental state is defined as:
O t =[o’ t ,o’ t+1 …o’ t+W-1 ] (9)
wherein W represents the history state number of backtracking, O t Is a two-dimensional matrix of size W × L, and O t The thermodynamic diagram of (1) is called a spectral waterfall and contains information in frequency domain and time domain.
Step four: solving the anti-interference problem by using a deep reinforcement learning algorithm, and enabling the frequency spectrum to be waterfall O t Setting the environment state at the current time t;
the action space is as follows:
A={a 1 ,a 2 ,…,a n×m } (10)
where n is the number of users, m is the number of channels, a q The joint action that each user can take at time t is shown, q is 1,2, …, n × m, and there are n × m joint actions a (t).
The joint action a (t) at time t is:
a(t)=[f 1 (t),f 2 (t),…,f n (t)] (11)
wherein f is n (t) represents the center frequency of the channel selected by the user n at the time t;
the state probabilities are expressed as:
P:(O t ,a)→O t+1 (12)
means that the user is in state O t Transition to State O after Down execution of Joint action a (t) t+1 The transition probability of (2).
For immediate awards r (t) is defined as:
Figure BDA0003661635280000061
where c is the frequency hopping cost.
Step five: due to the environmental state O t Is based on the unknown probability P (O) t+1 |O t A (t)) is dynamically changing and the state action space is very large in the antijam decision process, approximating the Q function (state O) of each state-action pair with a deep Convolutional Neural Network (CNN) t And the expected discount long-term reward for action a (t), namely:
Figure BDA0003661635280000062
where r (t) is the immediate reward at the current time t, a' is the joint action taken by the set of users when the Q value is maximum, O t+1 If the user set is in state O t The next state for the joint action a (t) is taken, γ being the discount factor.
As shown in FIG. 3, two convolutional neural networks are established, one is a strategy neural network with a weight parameter theta, and the other is a strategy neural network with a weight parameter theta - And initializing weight parameters randomly. Waterfall O of two-dimensional frequency spectrum t The output after convolution is flattened into one-dimensional data through the expansion layer after being used as the input of the neural network, and the final output value is obtained through the four full-connection layers, so that the Q function can be expressed by utilizing the nonlinear function of the neural network.
Experience e per time step t t =(O t ,a(t),r(t),O t+1 ) Is stored in the data set D t And by random selectionSelecting elements in uniform distribution e-U (D) to obtain target value eta of machine learning i
Figure BDA0003661635280000071
e t =(O t ,a(t),r(t),O t+1 ) (16)
D t =(e 1 ,...,e t ) (17)
Wherein the content of the first and second substances,
Figure BDA0003661635280000072
is the parameter of the target Q network at the ith iteration, and a' is the order of selection policy network Q policy The maximum motion. When the input is O t The output of the target Q network is η i . Suppose the parameter of the strategy Q network at the ith iteration is theta i The mean square error of the target value from the actual output of the policy Q network can be taken as a loss function:
Figure BDA0003661635280000073
the gradient of the loss function is:
Figure BDA0003661635280000074
step six: in the training phase, according to the state O t Selecting the joint action a (t) by adopting a dynamic epsilon-greedy algorithm, namely randomly selecting the probability of the action a (t) to be epsilon at each iteration, and selecting the order strategy network Q policy Maximum action a ═ argmax a Q policy (O t ,a;θ i ) Has a probability of 1-epsilon, wherein
Figure BDA0003661635280000075
As an initial greedy probability, decay is the decay coefficient, i is the number of iterations, ε decreases exponentially with increasing number of iterations i, to preserveTo prove the exploratory nature of the algorithm, ε does not decay to 0. And sample e t =(O t ,a(t),r(t),O t+1 ) And storing the data into an experience playback pool D, and updating the experience playback pool with new samples according to a first-in first-out principle after the experience playback pool D is full.
After the number of elements in the empirical playback pool D is greater than one batch, Num samples e are randomly selected from the empirical playback pool D batch ={e k ,e k U (d), k 1,2, Num, the parameter θ of the policy network is performed by a gradient descent algorithm i Iteratively updating, for parameters of the target network
Figure BDA0003661635280000076
The parameters of the policy network are copied to update the parameters periodically (C times per iteration), and the training process is repeated until the maximum number of iterations is reached.
After the training is finished, the environmental state O t Input strategy network computing to obtain output Q (O) t A; theta), wherein a represents the joint action taken by the user, theta represents the weight of the policy network, the action corresponding to the maximum Q value is selected, the base station feedback helps each user to select the optimal communication frequency band for resisting interference, and the network parameters do not need to be continuously updated in an iterative manner.
The specific algorithm implementation process is as follows:
Figure BDA0003661635280000081
example 1
Embodiments of the invention are described in detail as follows: the system simulation adopts a Python Pythrch frame, a system model comprises two users, a base station and an interference machine, and the two users and the interference machine struggle with each other in a frequency band of B20 MHz. The base station performs full band sensing every 1ms at Δ f of 100KHz and retains the spectral data for 200ms, so the matrix O t Is 200 × 200. The bandwidth of the user signal is 4MHz, the center frequency is changed every 10ms in 4MHz steps, and the bandwidth of the interference signal is also 4 MHz. Both user signal and interference signalThe roll-off coefficient σ is a raised cosine waveform of 0.5, and the signal power of each of the user 1 and the user 2 is 0dBm, and the signal power of the jammer is 30 dBm. Demodulation threshold beta at all frequencies th Set to 10dB and channel gain set to G n =G j Each user hopping cost is set to c 0.2.
In this embodiment, 5 interference modes are selected, specifically as follows:
(1) frequency sweep interference:
the center frequency of the interference signal is determined by the sweep rate v and the time t:
Figure BDA0003661635280000091
wherein, "%" is the operator of the remainder; the sweep rate v was 0.6 GHz/s.
(2) Comb interference: the center frequencies of the interference signals are fixed at 6MHz and 14 MHz.
(3) Dynamic interference: the comb or swept interference pattern is randomly selected and remains constant for a certain time (100 ms).
(4) Intelligent interference: and selecting the center frequency with the high probability of the first two actions as the comb interference by counting the action probability of the user in the past 100 ms.
(5) Random interference: two of 2MHz, 6MHz, 10MHz, 14MHz and 18MHz are randomly selected as the center frequencies of the comb interference and are kept unchanged for a certain time (50 ms).
FIGS. 4(a), 4(b), 4(c), 4(d), and 4(e) are frequency spectrum waterfall graphs of 5 interference patterns in the embodiment of the present invention, in which the abscissa represents the frequency (unit is 10) 5 Hz), the ordinate represents time (in ms), the shades of the colours in the figure represent the magnitude of the power (in dBm), in the spectral waterfall the light squares represent the communication signals transmitted by the users, the dark diagonal lines and squares represent the interference signals emitted by the jammers, and the black background represents the noise.
Fig. 5(a), 5(b), 5(c), 5(d), and 5(e) are spectrograms of anti-interference models in 5 interference modes in the embodiment of the present invention, and it can be seen that after iterative training, the agent can learn an interference strategy of the jammer, help the user avoid an interference signal of the jammer, and simultaneously consider mutual interference caused by contention among users, so as to achieve effective anti-interference.
Fig. 6 is a comparison graph of normalized throughput of each user in the dynamic interference mode in the embodiment of the present invention, and it can be seen from the graph that as the number of iterations increases, the normalized throughput (probability of successful communication in unit time) of each user gradually increases and then tends to converge, compared with a deep reinforcement learning algorithm with random frequency hopping and a fixed epsilon value, the normalized throughput of each user after convergence of the present invention is superior to that of the other two algorithms, and after 4000 iterations, the normalized throughput can reach 0.94 or more, which proves that the algorithm provided by the present invention has a better anti-interference effect.
In summary, the multi-user communication anti-interference intelligent decision method provided by the invention adopts a dynamic epsilon-greedy strategy, improves the learning rate, can effectively cope with the mutual interference caused by external malicious interference and competition among users, is not limited to randomly selecting a communication frequency band by a frequency hopping technology, but helps the user to automatically select an optimal communication frequency band according to the current frequency spectrum state, namely the frequency band with the minimum possibility of interference.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention.

Claims (6)

1. A multi-user communication anti-interference intelligent decision method based on deep reinforcement learning is characterized by comprising the following steps of:
s1, constructing a multi-user wireless communication anti-interference system model composed of a plurality of users, a base station and an interference machine, wherein the users, the base station and the interference machine are randomly distributed in an open area and share a spectrum space;
s2, the base station acquires the sensed current spectrum information of the multi-user and the jammer;
s3, establishing two convolutional neural network models, taking current spectrum information as input of the convolutional neural network models, then selecting joint action according to a dynamic greedy algorithm, and helping a user to intelligently select a communication frequency band through base station feedback;
s4, calculating the immediate reward generated by the current time slot combined action, and storing the experience into an experience playback pool; the experience comprises a current spectrum selection state, a joint action, an immediate reward and next spectrum selection information;
s5, when the experience number in the experience playback pool reaches a given value, randomly extracting a certain number of experiences from the experience pool, updating the parameters of the strategy neural network, and updating the parameters of the target neural network once at fixed time intervals; and stopping iteration until the set iteration times are reached.
2. The deep reinforcement learning-based anti-interference intelligent decision method for multi-user communication according to claim 1, wherein in step S2, the base station receives the SINR of user u u Judging whether the communication of the user u is successful or not, and if the communication is successful, normalizing the threshold r u (f) Is 1, otherwise is 0;
SINR of user u received by base station u Comprises the following steps:
Figure FDA0003661635270000011
wherein, G u Representing the channel gain, G, of user u to the base station j Indicating the channel gain, U, of the jammer to the base station j (f) Representing the power spectral density of the jammer, f representing the signal frequency, f k Expressed as selecting the center frequency, f, of channel k for user u l Representing the interfering frequencies of a certain user-selected channel/n (f) representing the power spectral density of the noise,
Figure FDA0003661635270000012
representing co-channel interference from other users in the user set when user u selects channel k; δ (ε) is an indicator function, which is 1 if ε is true, and 0 otherwise;
Figure FDA0003661635270000013
representing a set of users;
definition of beta th For threshold of SNR transmission, when receiving SINR of user u u Greater than beta th When, the transmission is successful; SINR when receiving user u u Is less than or equal to beta th When, the transmission fails; then the threshold g is normalized u (f) Comprises the following steps:
Figure FDA0003661635270000021
3. the deep reinforcement learning-based multi-user communication anti-interference intelligent decision method according to claim 1, wherein in the step S3, one of the two convolutional neural networks is a policy neural network with a weight parameter θ, and the other is a policy neural network with a weight parameter θ - The target neural network of (2), and randomly initializing weight parameters; waterfall O of two-dimensional frequency spectrum t The output after convolution is flattened into one-dimensional data through an unfolding layer after being used as the input of the neural network and passing through four full-connection layers to obtain a final output value;
the joint action a (t) is selected by adopting a dynamic epsilon-greedy algorithm as follows:
at each iteration, the probability of randomly selecting action a (t) is epsilon, and the policy network Q is selected policy Maximum action a ═ argmax a Q policy (O t ,a;θ i ) Has a probability of 1-epsilon, wherein
Figure FDA0003661635270000022
ε 0 And (3) taking the initial greedy probability, decapay as a decay coefficient, i as the iteration number, epsilon decreases in an exponential level along with the increase of the iteration number i, and e as a natural constant.
4. The deep reinforcement learning-based multi-user communication anti-interference intelligent decision method according to claim 3, wherein the frequency spectrum waterfall O is t The solution process of (2) is as follows:
the discrete spectral sample values are defined as follows:
Figure FDA0003661635270000023
wherein, s (f) represents the power spectral density of the user received by the base station, Δ f is the resolution of the spectral analysis, and i is a positive integer and is used to represent the number of samples;
the result of the spectrum state observed by the base station each time is as follows:
o’ t =[o’ 1,t ,o’ 2,t … o’ L,t ]
then the frequency spectrum waterfall O t Is defined as:
O t =[o’ t ,o’ t+1 … o’ t+W-1 ]
wherein W represents the history state number of backtracking, O t The matrix is a two-dimensional matrix with a size of W × L, and includes information of a frequency domain and a time domain.
5. The deep reinforcement learning-based anti-interference intelligent decision method for multi-user communication according to claim 3, wherein in step S4, the process of calculating the immediate reward generated by the current time slot action is as follows:
the motion space is represented as:
A={a 1 ,a 2 ,...,a n×m }
where n is the number of users, m is the number of channels, a q Denotes the joint action taken by each user at the moment y, q 1,2Xm, total of nxm combined actions a (t);
the joint action a (t) at time t is:
a(t)=[f 1 (t),f 1 (t),...,f n (t)]
wherein f is n (t) represents the center frequency of the channel selected by the user n at the time t;
the state transition probability means that the user set is in the state O t Transition to State O after Down execution of Joint action a (t) t+1 Is expressed as:
P:(0 t ,a)→O t+1
for the immediate award r (t), it is defined as:
Figure FDA0003661635270000031
where c is the frequency hopping cost.
6. The deep reinforcement learning-based multi-user communication anti-interference intelligent decision-making method according to claim 3, wherein in the step S5, after the number of elements in the empirical playback pool D is greater than one batch, Num samples e are randomly selected from the empirical playback pool D batch ={e k ,e k U (d), k 1,2, Num, the parameter θ of the policy network is performed by a gradient descent algorithm i Carrying out iterative updating; parameters for a target network
Figure FDA0003661635270000032
The parameters of the policy network are regularly copied to realize parameter updating;
after the training is finished, the environmental state O is t Input strategy network calculation to obtain output Q (O) t A; theta), wherein a represents the joint action taken by the user, theta represents the weight of the policy network, the action corresponding to the maximum Q value is selected, and the base station feeds back to help each user select the optimal communication frequency band for resisting interference.
CN202210579127.8A 2022-05-25 2022-05-25 Multi-user communication anti-interference intelligent decision-making method based on deep reinforcement learning Pending CN115103446A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210579127.8A CN115103446A (en) 2022-05-25 2022-05-25 Multi-user communication anti-interference intelligent decision-making method based on deep reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210579127.8A CN115103446A (en) 2022-05-25 2022-05-25 Multi-user communication anti-interference intelligent decision-making method based on deep reinforcement learning

Publications (1)

Publication Number Publication Date
CN115103446A true CN115103446A (en) 2022-09-23

Family

ID=83288612

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210579127.8A Pending CN115103446A (en) 2022-05-25 2022-05-25 Multi-user communication anti-interference intelligent decision-making method based on deep reinforcement learning

Country Status (1)

Country Link
CN (1) CN115103446A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116846509A (en) * 2023-06-07 2023-10-03 哈尔滨工程大学 Reinforcement learning anti-interference communication method based on implicit opponent modeling
CN116996919A (en) * 2023-09-26 2023-11-03 中南大学 Single-node multi-domain anti-interference method based on reinforcement learning
CN117750525A (en) * 2024-02-19 2024-03-22 中国电子科技集团公司第十研究所 Frequency domain anti-interference method and system based on reinforcement learning

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116846509A (en) * 2023-06-07 2023-10-03 哈尔滨工程大学 Reinforcement learning anti-interference communication method based on implicit opponent modeling
CN116996919A (en) * 2023-09-26 2023-11-03 中南大学 Single-node multi-domain anti-interference method based on reinforcement learning
CN116996919B (en) * 2023-09-26 2023-12-05 中南大学 Single-node multi-domain anti-interference method based on reinforcement learning
CN117750525A (en) * 2024-02-19 2024-03-22 中国电子科技集团公司第十研究所 Frequency domain anti-interference method and system based on reinforcement learning
CN117750525B (en) * 2024-02-19 2024-05-31 中国电子科技集团公司第十研究所 Frequency domain anti-interference method and system based on reinforcement learning

Similar Documents

Publication Publication Date Title
CN115103446A (en) Multi-user communication anti-interference intelligent decision-making method based on deep reinforcement learning
CN108777872B (en) Intelligent anti-interference method and intelligent anti-interference system based on deep Q neural network anti-interference model
US10541765B1 (en) Processing of communications signals using machine learning
CN111970072B (en) Broadband anti-interference system and method based on deep reinforcement learning
CN109274456B (en) Incomplete information intelligent anti-interference method based on reinforcement learning
Xu et al. An intelligent anti-jamming scheme for cognitive radio based on deep reinforcement learning
Li et al. Dynamic spectrum anti-jamming in broadband communications: A hierarchical deep reinforcement learning approach
CN107949025A (en) A kind of network selecting method based on non-cooperative game
CN111786738B (en) Anti-interference learning network structure based on long-term and short-term memory and learning method
Nikoloska et al. Fast power control adaptation via meta-learning for random edge graph neural networks
Motiian et al. Particle Swarm Optimization (PSO) of power allocation in cognitive radio systems with interference constraints
Li et al. Intelligent anti-jamming communication with continuous action decision for ultra-dense network
CN115567148A (en) Intelligent interference method based on cooperative Q learning
Zhou et al. A countermeasure against random pulse jamming in time domain based on reinforcement learning
Thien et al. A transfer games actor–critic learning framework for anti-jamming in multi-channel cognitive radio networks
Noori et al. Jamming and anti‐jamming in interference channels: a stochastic game approach
CN108449151B (en) Spectrum access method in cognitive radio network based on machine learning
Li et al. Deep reinforcement learning-based anti-jamming algorithm using dual action network
Perera et al. Flex-Net: A graph neural network approach to resource management in flexible duplex networks
Li et al. “Advancing Secretly by an Unknown Path”: A Reinforcement Learning-Based Hidden Strategy for Combating Intelligent Reactive Jammer
Zhang et al. A multi‐agent reinforcement learning anti‐jamming method with partially overlapping channels
CN115276858B (en) Dynamic spectrum multi-domain anti-interference method and system based on cognitive anti-interference model
CN116866048A (en) Anti-interference zero-and Markov game model and maximum and minimum depth Q learning method
Nikoukar et al. Predictive interference management for wireless channels in the Internet of Things
Wang et al. Q-learning based adaptive frequency hopping strategy under probabilistic jamming

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination