CN115103446A

CN115103446A - Multi-user communication anti-interference intelligent decision-making method based on deep reinforcement learning

Info

Publication number: CN115103446A
Application number: CN202210579127.8A
Authority: CN
Inventors: 田峰; 马亮; 张嘉华; 吴晓富
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2022-05-25
Filing date: 2022-05-25
Publication date: 2022-09-23

Abstract

The invention discloses a multi-user communication anti-interference intelligent decision method based on deep reinforcement learning, which comprises the following steps: the method comprises the steps of constructing a multi-user wireless communication anti-interference system model, firstly, using current spectrum information of perceived multi-users and an interference machine as input of a deep reinforcement learning strategy neural network by a base station, then selecting a joint action according to a dynamic greedy algorithm, and helping the users to intelligently select communication frequency bands through feedback of the base station; meanwhile, the immediate reward generated by the current time slot joint action is calculated, and the experience is stored in an experience playback pool. When the experience number in the experience playback pool reaches a given value, randomly extracting a certain number of parameters of the experience updating strategy neural network from the experience pool, and updating the parameters of the target neural network once at fixed time intervals; and repeating the training process to complete the anti-interference intelligent decision-making method for multi-user communication. The invention can realize the anti-interference of multi-user communication and effectively avoid the communication interference caused by an external jammer and an internal user.

Description

Multi-user communication anti-interference intelligent decision-making method based on deep reinforcement learning

Technical Field

The invention relates to a multi-user communication anti-interference intelligent decision method, in particular to a multi-user communication anti-interference intelligent decision method based on deep reinforcement learning.

Background

Interference attacks are problematic in wireless communication networks. In general, a malicious interference signal interrupts normal data reception of a legitimate transmission link, and poses a serious threat to communication security. Furthermore, in a multi-user scenario where there are multiple transmission links, communication performance may degrade even more as the user is subject to external malicious interference and internal co-channel interference. Therefore, the cooperative anti-interference problem of communication still needs to be further researched.

Spread spectrum technology is the mainstream anti-jamming technology for communication, and frequency hopping spread spectrum and direct sequence spread spectrum are widely used. The technologies have obvious anti-interference effects on conventional interference such as frequency sweep interference, pulse interference, broadband blocking interference and the like. However, on the one hand, the conventional anti-interference method has certain limitations, such as: frequency hopping spread spectrum relies on a predetermined frequency hopping pattern and direct sequence spread spectrum relies on a local pseudo random code. On the other hand, with the development of artificial intelligence and software radio technology, new trends of diversity, dynamics, intellectualization and the like of the jammers put higher requirements on communication anti-interference technology.

Disclosure of Invention

The purpose of the invention is as follows: the invention aims to provide a multi-user communication anti-interference intelligent decision method based on deep reinforcement learning, which can effectively cope with external malicious interference and avoid mutual interference among users.

The technical scheme is as follows: the invention discloses a multi-user communication anti-interference intelligent decision-making method, which comprises the following steps of firstly constructing wireless communication anti-interference system models of a plurality of users, and then helping each user to intelligently select an optimal communication frequency band through the feedback of a base station, wherein the method comprises the following steps:

s1, constructing a multi-user wireless communication anti-interference system model consisting of a plurality of users, a base station and an interference machine, wherein the users, the base station and the interference machine are randomly distributed in an open area and share a spectrum space;

s2, the base station acquires the sensed current spectrum information of the multi-user and the jammer;

s3, establishing two convolutional neural network models, taking current spectrum information as input of the convolutional neural network models, then selecting joint action according to a dynamic greedy algorithm, and helping a user to intelligently select a communication frequency band through base station feedback;

s4, calculating the immediate reward generated by the current time slot combined action, and storing the experience into an experience playback pool; the experience comprises a current spectrum selection state, a joint action, an immediate reward and next spectrum selection information;

s5, when the experience number in the experience playback pool reaches a given value, randomly extracting a certain number of experiences from the experience pool, updating the parameters of the strategy neural network, and updating the parameters of the target neural network once at fixed time intervals; and stopping iteration until the set iteration times are reached.

Further, in step S2, the base station receives the SINR of the user u _u Judging whether the communication of the user u is successful or not, and if the communication is successful, normalizing the threshold r _u (f) Is 1, otherwise is 0;

SINR of user u received by base station _u Comprises the following steps:

wherein, G _u Representing the channel gain, G, of user u to the base station _j Indicating the channel gain, U, of the jammer to the base station _j (f) Representing the power spectral density of the jammer, f representing the signal frequency, f _k Expressed as selecting the center frequency, f, of channel k for user u _l Representing the interfering frequencies of a certain user-selected channel/n (f) representing the power spectral density of the noise,

representing co-channel interference from other users in the user set when user u selects channel k; delta (. epsilon)To indicate a function, 1 if ε is true, and 0 otherwise;

representing a set of users;

definition of beta _th For threshold of SNR transmission, when receiving SINR of user u _u Greater than beta _th When, the transmission is successful; SINR when receiving user u _u Is less than or equal to beta _th When, the transmission fails; then the threshold g is normalized _u (f) Comprises the following steps:

further, in step S3, one of the two convolutional neural networks is a policy neural network with a weight parameter θ, and the other is a policy neural network with a weight parameter θ ^- The target neural network of (2), and randomly initializing weight parameters; two-dimensional frequency spectrum waterfall O _t The output after convolution is flattened into one-dimensional data through an unfolding layer after being used as the input of the neural network and passing through four full-connection layers to obtain a final output value;

the joint action a (t) is selected by adopting a dynamic epsilon-greedy algorithm as follows:

at each iteration, the probability of randomly selecting action a (t) is epsilon, and the policy network Q is selected _policy Maximum action a ═ argmax _a Q _policy (O _t ,a；θ _i ) Has a probability of 1-epsilon, wherein

ε ₀ The initial greedy probability is obtained, decapay is a decay coefficient, i is the iteration number, epsilon decreases in an exponential level along with the increase of the iteration number i, and e is a natural constant.

Further, the frequency spectrum waterfall O _t The solution process of (2) is as follows:

the discrete spectral sample values are defined as follows:

wherein, s (f) represents the user power spectral density received by the base station, Δ f is the resolution of the spectral analysis, and i is a positive integer and is used to represent the number of samples;

the result of the spectrum state observed by the base station each time is as follows:

o’ _t ＝[o’ _1,t ,o’ _2,t …o’ _L,t ]

then spectrum waterfall O _t Is defined as:

O _t ＝[o’ _t ,o’ _t+1 …o’ _t+W-1 ]

wherein W represents the history state number of backtracking, O _t The matrix is a two-dimensional matrix with a size of W × L, and includes information of a frequency domain and a time domain.

Further, in step S4, the process of calculating the immediate reward generated by the current time slot action is as follows:

the motion space is represented as:

A＝{a ₁ ,a ₂ ,…,a _n×m }

where n is the number of users, m is the number of channels, a _q Representing the joint action taken by each user at time t, wherein q is 1,2, …, n × m, and n × m joint actions a (t) are total;

the joint action a (t) at time t is:

a(t)＝[f ₁ (t),f ₁ (t),…,f _n (t)]

wherein f is _n (t) represents the center frequency of the channel selected by the user n at the time t;

the state transition probability means that the user set is in the state O _t Transition to state O after next execution of join actions a (t) _t+1 Is then expressed as:

P:(O _t ,a)→O _t+1

for the immediate award r (t), it is defined as:

where c is the frequency hopping cost.

Further, in the step S5, after the number of elements in the empirical playback pool D is greater than one batch, Num samples e are randomly selected from the empirical playback pool D _batch ＝{e _k ,e _k -u (d), k 1,2, Num, the parameter θ of the policy network being performed by a gradient descent algorithm _i Carrying out iterative updating; parameters for a target network

The parameters of the policy network are regularly copied to realize parameter updating;

after the training is finished, the environmental state O is _t Input strategy network calculation to obtain output Q (O) _t A; theta) represents the combined action taken by the user, theta represents the weight of the policy network, the action corresponding to the maximum Q value is selected, and the feedback of the base station helps each user to select the optimal communication frequency band for resisting interference.

Compared with the prior art, the invention has the following remarkable effects:

1. in the multi-user communication anti-interference scene, the deep reinforcement learning is used, the traditional reinforcement learning is not limited, and meanwhile, a dynamic epsilon-greedy strategy is adopted, so that the learning rate is improved, and the convergence rate of the algorithm is accelerated;

2. the invention constructs a wireless communication anti-interference system model of a plurality of users, does not limit the random selection of communication frequency bands through a frequency hopping technology, but helps each user to intelligently select the optimal communication frequency band through the feedback of the base station according to the current frequency spectrum state, namely, selects the frequency band with the minimum possibility of interference, can effectively deal with external malicious interference, and avoids internal mutual interference among users.

Drawings

FIG. 1 is a system flow diagram of the present invention;

FIG. 2 is a diagram of a multi-user wireless communication anti-interference system of the present invention;

FIG. 3 is a diagram of a neural network structure for deep reinforcement learning according to the present invention;

FIG. 4(a) is a spectrum diagram of a swept frequency interference mode in an embodiment of the present invention;

FIG. 4(b) is a spectrum diagram of a comb interference pattern in an embodiment of the present invention;

FIG. 4(c) is a spectrum diagram of a dynamic interference pattern in an embodiment of the present invention;

fig. 4(d) is a spectrum diagram of an intelligent interference pattern in an embodiment of the present invention;

FIG. 4(e) is a spectrum diagram of a random interference pattern in an embodiment of the present invention;

fig. 5(a) is a frequency spectrum diagram of an anti-interference model in a frequency sweep interference mode in the embodiment of the present invention;

FIG. 5(b) is a spectrum diagram of an interference rejection model in a comb interference mode according to an embodiment of the present invention;

FIG. 5(c) is a frequency spectrum diagram of an interference rejection model in a dynamic interference mode according to an embodiment of the present invention;

FIG. 5(d) is a spectrum diagram of an interference rejection model in the intelligent interference mode according to an embodiment of the present invention;

FIG. 5(e) is a spectrum diagram of an interference rejection model under a random interference mode in an embodiment of the present invention;

fig. 6 is a graph comparing normalized throughput of each user in dynamic interference mode according to the embodiment of the present invention.

Detailed Description

The invention is described in further detail below with reference to the drawings and the detailed description.

The multi-user communication anti-interference intelligent decision method comprises the steps of firstly constructing a multi-user wireless communication anti-interference system model, firstly enabling a base station to use perceived current spectrum information of multiple users and an interference machine as input of a deep reinforcement learning strategy neural network (selection of channel strategy actions is achieved), then selecting combined actions (namely multi-user communication channel selection) according to a dynamic greedy algorithm, and helping the users to intelligently select communication frequency bands through feedback of the base station. Meanwhile, the immediate reward generated by the current time slot joint action is calculated, and the experience (the current spectrum selection state, the action, the immediate reward and the next spectrum selection information) is stored in an experience playback pool. When the experience number in the experience playback pool reaches a given value, randomly extracting a certain number of parameters of the experience updating strategy neural network from the experience pool, updating the parameters of the target neural network at fixed time intervals, repeating the process, and finally realizing that each user can intelligently select an optimal communication frequency band, effectively coping with external malicious interference and avoiding internal mutual interference among users. The general flow chart is shown in fig. 1, and specifically includes the following steps:

the method comprises the following steps: building a multi-user wireless communication anti-interference system model

As shown in fig. 2, the multi-user wireless communication anti-interference system model of the present invention is composed of a plurality of users, a base station and an interference unit. The user, the base station and the jammer are randomly distributed in an open area and share a spectrum space. The users communicate with the base station respectively, and the user set is represented as:

the spectral space is divided into a number of channels, the set of channels being represented as:

the number of available channels is m (n)<m), the bandwidth of each channel is b, the transmission power p of the k user _k Comprises the following steps:

wherein, U _k (f) The power spectral density of a signal transmitted for the kth user, n is the number of users, and m is the number of channels.

The channel set of jammer interference is:

J＝{1，…，j} (4)

where j is the number of interfering channels.

When the channel of the interference is the same as the channel of the user communication, the interference is successful. If two or more users select the same frequency band, mutual interference may result. In order to achieve reliable transmission, both external malicious interference and mutual interference caused by contention between users need to be considered. The base station carries an agent, has the capabilities of spectrum sensing, learning and decision-making, defines the received spectrum information of the current time slot as an environment state after each communication time slot, uses a deep reinforcement learning algorithm to make an anti-interference decision, and informs each user of a communication channel of the current time slot.

Step two: SINR of user u received by base station _u Comprises the following steps:

wherein, G _u Representing the channel gain, G, of a user u to the base station _j Indicating the jammer to base station channel gain, U _j (f) Representing the power spectral density, f, of the jammer _k Expressed as the center frequency of the selected channel k for user u, n (f) the power spectral density of the noise, f the signal frequency, f _l Indicating the interfering frequency of a certain user-selected channel/,

representing co-channel interference from other users in the user set when user u selects channel k; δ (ε) is an indicator function, which is 1 if ε is true and 0 otherwise.

Definition of beta _th As threshold value of SNR transmission, when receiving SINR of user u _u Greater than beta _th When, the transmission is successful; SINR when receiving user u _u Is less than or equal to beta _th When transmission fails, i.e. normalizing the threshold g _u (f) Comprises the following steps:

step three: defining discrete spectral sample values as

Wherein S (f) is ∑ _k∈N G _k U _k (f-f _k )+G _j U _j (f-f _k ) + n (f) represents the user power spectral density received by the base station, Δ f is the resolution of the spectral analysis, and i is a positive integer and is used to represent the number of samples.

o’ _t ＝[o’ _1,t ,o’ _2,t …o’ _L,t ] (8)

the environmental state is defined as:

O _t ＝[o’ _t ,o’ _t+1 …o’ _t+W-1 ] (9)

wherein W represents the history state number of backtracking, O _t Is a two-dimensional matrix of size W × L, and O _t The thermodynamic diagram of (1) is called a spectral waterfall and contains information in frequency domain and time domain.

Step four: solving the anti-interference problem by using a deep reinforcement learning algorithm, and enabling the frequency spectrum to be waterfall O _t Setting the environment state at the current time t;

the action space is as follows:

A＝{a ₁ ,a ₂ ,…,a _n×m } (10)

where n is the number of users, m is the number of channels, a _q The joint action that each user can take at time t is shown, q is 1,2, …, n × m, and there are n × m joint actions a (t).

The joint action a (t) at time t is:

a(t)＝[f ₁ (t),f ₂ (t),…,f _n (t)] (11)

the state probabilities are expressed as:

P:(O _t ,a)→O _t+1 (12)

means that the user is in state O _t Transition to State O after Down execution of Joint action a (t) _t+1 The transition probability of (2).

For immediate awards r (t) is defined as:

where c is the frequency hopping cost.

Step five: due to the environmental state O _t Is based on the unknown probability P (O) _t+1 |O _t A (t)) is dynamically changing and the state action space is very large in the antijam decision process, approximating the Q function (state O) of each state-action pair with a deep Convolutional Neural Network (CNN) _t And the expected discount long-term reward for action a (t), namely:

where r (t) is the immediate reward at the current time t, a' is the joint action taken by the set of users when the Q value is maximum, O _t+1 If the user set is in state O _t The next state for the joint action a (t) is taken, γ being the discount factor.

As shown in FIG. 3, two convolutional neural networks are established, one is a strategy neural network with a weight parameter theta, and the other is a strategy neural network with a weight parameter theta ^- And initializing weight parameters randomly. Waterfall O of two-dimensional frequency spectrum _t The output after convolution is flattened into one-dimensional data through the expansion layer after being used as the input of the neural network, and the final output value is obtained through the four full-connection layers, so that the Q function can be expressed by utilizing the nonlinear function of the neural network.

Experience e per time step t _t ＝(O _t ,a(t),r(t),O _t+1 ) Is stored in the data set D _t And by random selectionSelecting elements in uniform distribution e-U (D) to obtain target value eta of machine learning _i ：

e _t ＝(O _t ,a(t),r(t),O _t+1 ) (16)

D _t ＝(e ₁ ,...,e _t ) (17)

Wherein the content of the first and second substances,

is the parameter of the target Q network at the ith iteration, and a' is the order of selection policy network Q _policy The maximum motion. When the input is O _t The output of the target Q network is η _i . Suppose the parameter of the strategy Q network at the ith iteration is theta _i The mean square error of the target value from the actual output of the policy Q network can be taken as a loss function:

the gradient of the loss function is:

step six: in the training phase, according to the state O _t Selecting the joint action a (t) by adopting a dynamic epsilon-greedy algorithm, namely randomly selecting the probability of the action a (t) to be epsilon at each iteration, and selecting the order strategy network Q _policy Maximum action a ═ argmax _a Q _policy (O _t ,a；θ _i ) Has a probability of 1-epsilon, wherein

As an initial greedy probability, decay is the decay coefficient, i is the number of iterations, ε decreases exponentially with increasing number of iterations i, to preserveTo prove the exploratory nature of the algorithm, ε does not decay to 0. And sample e _t ＝(O _t ,a(t),r(t),O _t+1 ) And storing the data into an experience playback pool D, and updating the experience playback pool with new samples according to a first-in first-out principle after the experience playback pool D is full.

After the number of elements in the empirical playback pool D is greater than one batch, Num samples e are randomly selected from the empirical playback pool D _batch ＝{e _k ,e _k U (d), k 1,2, Num, the parameter θ of the policy network is performed by a gradient descent algorithm _i Iteratively updating, for parameters of the target network

The parameters of the policy network are copied to update the parameters periodically (C times per iteration), and the training process is repeated until the maximum number of iterations is reached.

After the training is finished, the environmental state O _t Input strategy network computing to obtain output Q (O) _t A; theta), wherein a represents the joint action taken by the user, theta represents the weight of the policy network, the action corresponding to the maximum Q value is selected, the base station feedback helps each user to select the optimal communication frequency band for resisting interference, and the network parameters do not need to be continuously updated in an iterative manner.

The specific algorithm implementation process is as follows:

example 1

Embodiments of the invention are described in detail as follows: the system simulation adopts a Python Pythrch frame, a system model comprises two users, a base station and an interference machine, and the two users and the interference machine struggle with each other in a frequency band of B20 MHz. The base station performs full band sensing every 1ms at Δ f of 100KHz and retains the spectral data for 200ms, so the matrix O _t Is 200 × 200. The bandwidth of the user signal is 4MHz, the center frequency is changed every 10ms in 4MHz steps, and the bandwidth of the interference signal is also 4 MHz. Both user signal and interference signalThe roll-off coefficient σ is a raised cosine waveform of 0.5, and the signal power of each of the user 1 and the user 2 is 0dBm, and the signal power of the jammer is 30 dBm. Demodulation threshold beta at all frequencies _th Set to 10dB and channel gain set to G _n ＝G _j Each user hopping cost is set to c 0.2.

In this embodiment, 5 interference modes are selected, specifically as follows:

(1) frequency sweep interference:

the center frequency of the interference signal is determined by the sweep rate v and the time t:

wherein, "%" is the operator of the remainder; the sweep rate v was 0.6 GHz/s.

(2) Comb interference: the center frequencies of the interference signals are fixed at 6MHz and 14 MHz.

(3) Dynamic interference: the comb or swept interference pattern is randomly selected and remains constant for a certain time (100 ms).

(4) Intelligent interference: and selecting the center frequency with the high probability of the first two actions as the comb interference by counting the action probability of the user in the past 100 ms.

(5) Random interference: two of 2MHz, 6MHz, 10MHz, 14MHz and 18MHz are randomly selected as the center frequencies of the comb interference and are kept unchanged for a certain time (50 ms).

FIGS. 4(a), 4(b), 4(c), 4(d), and 4(e) are frequency spectrum waterfall graphs of 5 interference patterns in the embodiment of the present invention, in which the abscissa represents the frequency (unit is 10) ⁵ Hz), the ordinate represents time (in ms), the shades of the colours in the figure represent the magnitude of the power (in dBm), in the spectral waterfall the light squares represent the communication signals transmitted by the users, the dark diagonal lines and squares represent the interference signals emitted by the jammers, and the black background represents the noise.

Fig. 5(a), 5(b), 5(c), 5(d), and 5(e) are spectrograms of anti-interference models in 5 interference modes in the embodiment of the present invention, and it can be seen that after iterative training, the agent can learn an interference strategy of the jammer, help the user avoid an interference signal of the jammer, and simultaneously consider mutual interference caused by contention among users, so as to achieve effective anti-interference.

Fig. 6 is a comparison graph of normalized throughput of each user in the dynamic interference mode in the embodiment of the present invention, and it can be seen from the graph that as the number of iterations increases, the normalized throughput (probability of successful communication in unit time) of each user gradually increases and then tends to converge, compared with a deep reinforcement learning algorithm with random frequency hopping and a fixed epsilon value, the normalized throughput of each user after convergence of the present invention is superior to that of the other two algorithms, and after 4000 iterations, the normalized throughput can reach 0.94 or more, which proves that the algorithm provided by the present invention has a better anti-interference effect.

In summary, the multi-user communication anti-interference intelligent decision method provided by the invention adopts a dynamic epsilon-greedy strategy, improves the learning rate, can effectively cope with the mutual interference caused by external malicious interference and competition among users, is not limited to randomly selecting a communication frequency band by a frequency hopping technology, but helps the user to automatically select an optimal communication frequency band according to the current frequency spectrum state, namely the frequency band with the minimum possibility of interference.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention.

Claims

1. A multi-user communication anti-interference intelligent decision method based on deep reinforcement learning is characterized by comprising the following steps of:

s1, constructing a multi-user wireless communication anti-interference system model composed of a plurality of users, a base station and an interference machine, wherein the users, the base station and the interference machine are randomly distributed in an open area and share a spectrum space;

2. The deep reinforcement learning-based anti-interference intelligent decision method for multi-user communication according to claim 1, wherein in step S2, the base station receives the SINR of user u _u Judging whether the communication of the user u is successful or not, and if the communication is successful, normalizing the threshold r _u (f) Is 1, otherwise is 0;

SINR of user u received by base station _u Comprises the following steps:

representing co-channel interference from other users in the user set when user u selects channel k; δ (ε) is an indicator function, which is 1 if ε is true, and 0 otherwise;

representing a set of users;

3. the deep reinforcement learning-based multi-user communication anti-interference intelligent decision method according to claim 1, wherein in the step S3, one of the two convolutional neural networks is a policy neural network with a weight parameter θ, and the other is a policy neural network with a weight parameter θ ^- The target neural network of (2), and randomly initializing weight parameters; waterfall O of two-dimensional frequency spectrum _t The output after convolution is flattened into one-dimensional data through an unfolding layer after being used as the input of the neural network and passing through four full-connection layers to obtain a final output value;

at each iteration, the probability of randomly selecting action a (t) is epsilon, and the policy network Q is selected _policy Maximum action a ═ argmax _a Q _policy (O _t ，a；θ _i ) Has a probability of 1-epsilon, wherein

ε ₀ And (3) taking the initial greedy probability, decapay as a decay coefficient, i as the iteration number, epsilon decreases in an exponential level along with the increase of the iteration number i, and e as a natural constant.

4. The deep reinforcement learning-based multi-user communication anti-interference intelligent decision method according to claim 3, wherein the frequency spectrum waterfall O is _t The solution process of (2) is as follows:

the discrete spectral sample values are defined as follows:

wherein, s (f) represents the power spectral density of the user received by the base station, Δ f is the resolution of the spectral analysis, and i is a positive integer and is used to represent the number of samples;

o’ _t ＝[o’ _1，t ，o’ _2，t … o’ _L，t ]

then the frequency spectrum waterfall O _t Is defined as:

O _t ＝[o’ _t ，o’ _t+1 … o’ _t+W-1 ]

5. The deep reinforcement learning-based anti-interference intelligent decision method for multi-user communication according to claim 3, wherein in step S4, the process of calculating the immediate reward generated by the current time slot action is as follows:

the motion space is represented as:

A＝{a ₁ ，a ₂ ，...，a _n×m }

where n is the number of users, m is the number of channels, a _q Denotes the joint action taken by each user at the moment y, q 1,2Xm, total of nxm combined actions a (t);

the joint action a (t) at time t is:

a(t)＝[f ₁ (t)，f ₁ (t)，...，f _n (t)]

the state transition probability means that the user set is in the state O _t Transition to State O after Down execution of Joint action a (t) _t+1 Is expressed as:

P：(0 _t ，a)→O _t+1

for the immediate award r (t), it is defined as:

where c is the frequency hopping cost.

6. The deep reinforcement learning-based multi-user communication anti-interference intelligent decision-making method according to claim 3, wherein in the step S5, after the number of elements in the empirical playback pool D is greater than one batch, Num samples e are randomly selected from the empirical playback pool D _batch ＝{e _k ，e _k U (d), k 1,2, Num, the parameter θ of the policy network is performed by a gradient descent algorithm _i Carrying out iterative updating; parameters for a target network

after the training is finished, the environmental state O is _t Input strategy network calculation to obtain output Q (O) _t A; theta), wherein a represents the joint action taken by the user, theta represents the weight of the policy network, the action corresponding to the maximum Q value is selected, and the base station feeds back to help each user select the optimal communication frequency band for resisting interference.