CN117715054A

CN117715054A - Multi-body cooperative satellite access and anti-interference method based on deep reinforcement learning

Info

Publication number: CN117715054A
Application number: CN202311658293.8A
Authority: CN
Inventors: 王洪圆; 潘健雄; 欧阳巧琳; 王培森; 齐斌; 许鲁彦; 叶能
Original assignee: 32802 Troops Of People's Liberation Army Of China; Beijing Institute of Technology BIT
Current assignee: 32802 Troops Of People's Liberation Army Of China; Beijing Institute of Technology BIT
Priority date: 2023-12-05
Filing date: 2023-12-05
Publication date: 2024-03-15

Abstract

A multi-body cooperative satellite access and anti-interference method based on deep reinforcement learning belongs to the field of satellite communication. Constructing a part of link neural network by using an Actor-Critic offline learning method in deep reinforcement learning, and soft-updating neural network parameters by using a target network, so that decision performance in the countermeasure process is improved, and the method is better suitable for the change in an electromagnetic environment; in the state modeling of environment modeling and reinforcement learning, the action at the last moment is integrated into the state, and then different actions are output in continuous time slots by combining with the judgment of rewards, so that the intelligent access has more flexibility and variability, and the anti-interference capability of the access is improved; by using the GPU computing network and the offline strategy reinforcement learning method, sample acquisition training and effective intelligent access can be performed under the condition of lacking training samples and priori data. The invention is suitable for the satellite communication field, and improves the anti-interference capability under the condition of ensuring the access accuracy of users.

Description

Multi-body cooperative satellite access and anti-interference method based on deep reinforcement learning

Technical Field

The invention relates to a multi-body collaborative satellite access and anti-interference method based on deep reinforcement learning, belonging to the field of satellite communication.

Background

Satellite network mass access refers to providing high-speed internet access services to users in a wide area through satellite technology. This concept arose from the urgent need for worldwide internet popularity, particularly in those areas where geographic conditions are severe and infrastructure is relatively weak. The development of a satellite communication system is an important measure for occupying a space information network development high point and realizing a network national goal; the method can promote the industrialized development of the industry services such as navigation enhancement, wide area monitoring, data acquisition and distribution and the like, and is an important measure for pulling the comprehensive development of commercial aerospace and leading the information industry and the aerospace technology to be upgraded.

However, a first difficulty with large-scale access to satellite networks is that current random access protocols perform poorly in ultra-dense networks, requiring efficient access schemes that can handle large numbers of requests. Another challenge is interference attacks. To address access congestion issues, some enhanced random access schemes include priority, packet-based random access, and code-extended random access. Some studies also consider coded random access and sparse code multiple access. However, these schemes require a centralized scheduling mechanism, which is not available in wide area satellite access scenarios due to large propagation delays and large number of users. In order to resist interference attacks, common technical means include direct sequence spreading and frequency hopping spreading, as well as multi-beam antennas and adaptive anti-interference routing.

However, much work has been focused on random access mechanisms to increase success rates, and some also achieve interference immunity. Satellites are vulnerable to interference attacks due to the high openness. In a malicious interference environment, an jammer reduces channel quality by transmitting an interference signal, resulting in access failure. In addition, when the device is inaccessible, it may continually attempt retransmission, causing a dilemma of rapid battery discharge and increased channel blockage. Thus, there is a need for an advanced random access scheme to support large-scale operation of satellite networks under interference attacks. The conventional anti-interference method cannot cope with intelligent interference of adjusting an interference strategy according to user behaviors.

Therefore, for the multi-body cooperative intelligent access satellite and anti-interference problem of the complex electromagnetic environment, the traditional access and anti-interference method is referred, intelligent interference of a continuously-changing strategy is also faced, the deep reinforcement learning algorithm is combined, the change of the electromagnetic environment is continuously observed, the transformation rule of the interference is learned, and the multi-satellite access efficiency can be further improved better.

Disclosure of Invention

Aiming at the problems of insufficient anti-interference capability and poor environmental adaptability of the existing satellite access technology, the main purpose of the invention is to provide a multi-body collaborative satellite access and anti-interference method based on deep reinforcement learning, which combines a deep neural network with a traditional reinforcement learning Q-learning by adopting an Actor-Critic algorithm in the deep reinforcement learning, and sets two modes of traditional artificial interference and intelligent interference in an environment of satellite-ground collaborative access, and gives consideration to both transmission delay and resource allocation of transmission power. And under the condition of ensuring the access accuracy of the user, the anti-interference capability is improved.

The invention aims at realizing the following technical scheme:

the invention discloses a multi-agent cooperative electromagnetic anti-interference method based on deep reinforcement learning, which comprises the steps of firstly building a complex electromagnetic environment of multi-agent reinforcement learning, wherein the complex electromagnetic environment comprises transmission delay, signal fading and noise interference; the transmission channel is a time-varying channel with markov properties; the partial connection neural network is adopted, so that two actions of channel selection and power distribution can be simultaneously output; judging whether the action is good or not by adopting a denser rewarding mode; the agent cannot select the same channel in consecutive time slots to increase the variability of its decision; through multiple round iterations, the access capability and the anti-interference capability are continuously improved.

The invention discloses a multi-body cooperative satellite access and anti-interference method based on deep reinforcement learning, which comprises the following steps:

step one: constructing a complex electromagnetic environment of multiple intelligent agents;

an aerospace integrated network is built, and N intelligent users transmit information to satellites in the network, wherein M satellites are used for receiving the information, and 1 traditional jammer and 1 intelligent jammer are arranged in the network. The two jammers have the same opportunity to adopt a limited power access channel, and if a plurality of jammers and a user select the same channel in the same time slot, the user fails to transmit, and the jammers successfully interfere; otherwise, the user transmission is successful and the interference is avoided successfully. Furthermore, the interference path of the jammer is partially observable.

Step two: in the electromagnetic environment in the first step, a Cartesian space three-dimensional coordinate system and an Actor-Critic neural network are built;

the large-scale fading is modeled based on a processable line of sight, i.e., loS, probabilistic model with shadow and occlusion effects. In the LoS probability model, large-scale fading follows the generalized bernoulli distribution of two different events; the channel is LoS or non-LoS (NLoS) with a certain probability. Because of the satellite access model, only LoS channels are considered, so the large scale fading between satellite m and user n is expressed as:

the large scale fading between satellite m and jammer j is expressed as:

wherein beta is ₀ Is the reference distance d ₀ The average power gain at=1, l is the vector of the three-dimensional space coordinate system,the position vectors of the satellite, the user and the jammer are respectively represented, and alpha is a path loss index.

The channel gain between satellite m and user n is expressed as:

the channel gain between satellite m and jammer j is expressed as:

wherein the method comprises the steps ofAnd->Is the effect of small scale fading at time t, following the rice (Rician) distribution.

The transmission power of the jammer and the user n on the kth channel of the mth satellite are respectivelyAnd p _n (t). Thus, the channel capacity of user n on the kth channel of the mth satellite is:

where W is the bandwidth of the channel,representing gaussian noise power. The transmitting power of the user and the jammer is satisfied T epsilon T sumWherein P is _tot Is the maximum sum of the power that each user can be allowed to use at that time slot,the maximum power that can be used by the user and the adversary jammer, respectively.

Step three: according to the position coordinates of the three-dimensional space coordinate system in the second step, the acquired frequency spectrum information is dataized to obtain the input state of the neural network;

to ensure the stability of user access, the spectrum occupation condition of continuous time slots needs to be obtained. The available unoccupied channels are denoted as 1 and the unavailable occupied channels are denoted as-1. And using b as input to the neural network the spectrum occupancy of consecutive 9 and time slots _m (t) represents the observed channel conditions, b _m,k (t) =1 indicates that there is a successful access to the satellite by the user and that b _m,k (t) = -1 indicates that there is a user accessing the satellite but fails, b _m,k (t) =0 means that no user has access to the satellite.

Wherein,indicating that user n has access to satellite m at time tThe kth channel, u, represents the occupancy of the user. F (F) _n (t) two cases indicating that the user failed to access the satellite may be 0 and 1.

Definition b _m (t)＝[b _m,1 (t)，b _m,2 (t)····b _m,k (t)]，B(t)＝[b ₁ (t)，b ₂ (t)····b _M (t)] ^T To represent the occupancy of the channel.

Step four: after the state input in the third step is obtained, respectively inputting an Actor network and a Critic network, and enabling an intelligent body to perform two actions of selecting proper channels and power to resist an interference machine;

the optimized target gradients of the two neural networks at this time are as follows:

where ω is a value parameter, θ is a network parameter, qω (s _t ,a _t ) The current action Q value. Pi _θ (a _t |s _t ) For the current moment policy E _π Is a policy expectation.

The mean square error loss function is adopted:

where r is the reward for the action, gamma is the decay factor, V _w (s _t ) Representing the state cost function at the current time, V (s _t+1 ) Representing the state cost function at the next moment.

The Actor network selects a proper channel from the action space of the user, and simultaneously selects the used transmission power in the power action space, and the process is output action At. Critic network calculates a next time state V value V (s _t+1 ) Outputting the time division error to judge the quality of the action, wherein the time division error is calculated in the following way:

TD-Error＝Q(s,a)-V _w (s _t )

＝r+gamma*V(s _t+1 )-V _w (s _t ) (10)

wherein Q (s, a) is the Q value at the current moment.

Step five: the state in the third step is interacted with the action in the fourth step in the input environment to obtain the rewards of environment feedback;

the following conditions can occur in the interaction process of the actions of the intelligent body and the actions of the jammer with the environment, wherein the maximum rewards of each step are 3, and the minimum rewards are-1. The following C _n,m,k (t) represents the transmission rate of the kth channel of the satellite m selected by the user n at the current moment in time t, C _threshold Indicating the rate threshold for successful transmission.

(1) Selecting the correct channel correct power

In the case, the intelligent user selects a channel which is not occupied by other users, and the interference of an enemy jammer is avoided correctly, at the moment, the intelligent user obtains a reward r=3, and if a plurality of intelligent users select the same channel, the reward is returned to r=2; if interference is received but the power is sufficient to reach the transmission rate, C _n,m,k (t)>C _threshold At this point r=2, if multiple smart users select the same channel, the reward is back to r=1.

(2) Selecting the correct channel error power

In this case, the intelligent user selects a channel which is not occupied by other users, but is interfered by the adversary, and the power is not enough to reach the transmission rate, C _n,m,k (t)<C _threshold At this time, the prize is r=1, and if multiple smart users select the same channel, the prize is r=0

(3) Selecting the wrong channel

In this case, the intelligent user selects a channel identical to the other users, and when the intelligent user is not interfered by the hostile jammer, the reward is r=0, and if the intelligent user is interfered by the jammer, r= -1.

Step six: in the middle state of the third step, the actions in the fourth step and the rewards in the fifth step are input to the Actor and the Critic network together for experience learning, network parameters are optimized and updated, and the anti-interference effect of the satellite is achieved.

Firstly, learning a Critic network, obtaining a state St+1 at the next moment through environmental observation, obtaining a V value of the state at the next moment through network calculation, and calculating a Belman equation:

TD-Error＝Q(s,a)-V _(s) ＝r+gamma*V(s′)-V(s) (11)

after the time division Error TD-Error is calculated, the time division Error TD-Error is sent to an Actor network together with the state S and the action A for learning, and the TD-Error is a weight value when the Actor is updated. The Critic network does not need to estimate Q but rather V. The TD-Error can then be calculated, i.e., advantage function, and then minimized and expected. The policy gradients that need to be calculated at this time are:

the state cost function V can be naturally used as Baseline, pi _θ For the current strategy, gamma is the decay factor, and an updated weight is obtained, wherein A is _π,γ (s _t ,a _t ) As a potential function, A _π,γ (s,a)＝Qπ,γ(s,a)-Vπ,γ(s)。

Q pi, gamma (s, a) are Q values under the action of attenuation factors, V pi, gamma(s) are state cost functions under the action of attenuation factors.

The reward value can be used as a decision direction of the loss function training agent, and the larger the reward value is, the smaller the difference between the reward value and the expected Q value is, so that the training effect is better; the smaller the prize value, the larger the loss value, indicating poor action selection. If the rewarding value in the continuous time slot is very small or negative, other actions are explored to update the strategy gradient so as to find a better strategy to improve the anti-interference capability of the intelligent agent until the best strategy is found, so that the rewarding value of the environment feedback intelligent agent is close to the maximum rewarding value, and the strategy is stably converged, namely the satellite access and anti-interference effect are realized.

The beneficial effects are that:

1. the invention discloses a multi-body cooperative electromagnetic anti-interference method based on deep reinforcement learning, which uses an Actor-Critic offline learning algorithm in the deep reinforcement learning. And (3) constructing part of the connecting neural network to respectively perform parameter training on each network branch, so that the intelligent network access system and the intelligent network access method have the advantages of taking up more resources on resource allocation compared with other intelligent access technologies. In addition, the target network is used for soft updating of the neural network parameters, so that decision performance in the countermeasure process is improved, and the method is better suitable for changes in the electromagnetic environment.

2. In the multi-body cooperative electromagnetic anti-interference method based on deep reinforcement learning, in the state modeling of environment modeling and reinforcement learning, not only continuous spectrum information is regarded as state input, but also actions at the last moment are fused into the state, and then different actions can be output in continuous time slots by combining with judgment of rewards, so that intelligent access has flexibility and variability, and strategies are difficult to learn by intelligent interference. And under the condition of ensuring the access of the self channel resources, the anti-interference capability of the access is improved.

3. The multi-body collaborative electromagnetic anti-interference method based on deep reinforcement learning disclosed by the invention uses a GPU (graphic processing unit) computing network and an offline strategy reinforcement learning algorithm, can also perform sample acquisition training and effective intelligent access under the condition of lacking training samples and priori data, and simulates more complex and real electromagnetic countermeasure environments.

Drawings

FIG. 1 is a schematic flow chart of a multi-body cooperative electromagnetic anti-interference method based on deep reinforcement learning disclosed by the method;

FIG. 2 is a partially connected neural network constructed in this embodiment;

fig. 3 is a schematic diagram of random access in a satellite network under a malicious interference attack in the present embodiment;

FIG. 4 is an access performance diagram of the enemy-resistant conventional jammer in the present embodiment;

fig. 5 is an access performance diagram of the enemy-resistant intelligent jammer in the present embodiment.

Detailed Description

The present invention will be described in detail with reference to the accompanying drawings and examples. The technical problems and the beneficial effects solved by the technical proposal of the invention are also described, and the described embodiment is only used for facilitating the understanding of the invention and does not have any limiting effect.

The multi-agent cooperative electromagnetic anti-interference method based on deep reinforcement learning disclosed by the embodiment is applied to a TensorFlow2.0 and Tensorlayer2.0 framework, adopts a deep reinforcement learning Actor-Critic algorithm, replaces a traditional Q table calculation mode by using a part of connected neural network, bridges the processed spectrum information with electromagnetic environment observation based on an optimization model, sends the bridged state into the network to calculate to obtain an output action and an output action Q value respectively, and finally carries out empirical training on the state, the action, the rewarding and the Q value of the whole process to update parameters. The specific parameter settings are as follows:

attenuation factor gamma	0.95
		Actor learning rate	0.0001
Critic learning rate	0.001
		Number of channels	16
Training batch	300
		Every round of time slots	109
Greedy factor epsilon	0.15
		Optimizer	Adam

As shown in fig. 1, the method comprises the following steps:

step 101: and establishing an Actor part of an Actor-Critic algorithm to be connected with the neural network.

The part carrying one 3 layers is connected with an Actor neural network, wherein the first layer is a public layer and contains 256 neurons; the second layer is a normalization layer (Batchs Normalization) which can improve the convergence rate of the neural network and the generalization capability of the network to prevent overfitting. The normalized form is as follows:

y _i ←γx _i +β＝BN _γ,B (x _i ) (16)

where m is the number of samples, x _i Mu, as sample value _B As the mean value of the sample,for sample variance, gamma and beta are respectively introduced scale factors and shifting factors, E is a human setting, the denominator is prevented from being 0, BN _γ,B (x _i ) To pass through Bn layerPost-sample data, y _i And outputting the sample correspondingly.

The process is that m sample values x are obtained first _i Calculating the sample mean mu _B Re-calculating the sample varianceThe sample data is standardized, two parameters of a gamma scale factor and a beta translation factor are introduced, and the parameters are trained, so that the network learning can be enabled to recover the feature distribution required to be learned of the original network, and the normalized sample value ∈10 is calculated>Normalization can accelerate the convergence of the network, and the mean and mean square error can be considered as introducing noise, i.e., preventing overfitting.

The third layer is a 2 partially connected branch, each containing 128 neurons. The network structure is shown in fig. 2. The first branch outputs the selected action in the action space of channel selection, and the second branch outputs the action of distributing the power magnitude in the action space of power magnitude. Since the purposes of the two actions are different, and the directions of the calculated gradients are different, two different sets of network parameters need to be trained respectively.

Furthermore both branches may use the Adam optimizer identically and the same learning rate is 0.001.

Step 102: constructing fully connected Critic networks

The Critic portion of the algorithm is similar to the Actor portion, with 256 neurons in the first layer, followed by the Bn layer, and then the output layer. However, only one neuron is at the final output of the network, and the output value is time division Error (TD-Error) and is backward TD Error, so as to judge whether the output action of the Actor network is good or bad.

The calculation form of the TD error is as follows:

TD-Error＝r+gamma*V(s′)-V(s) (17)

similarly, adam optimizer is selected for use with a learning rate of 0.01

Step 103: the parameters required for other training of the network are configured. Setting a learning rate, a batch processing size, a weight initialization mode, a weight attenuation coefficient, an optimization method, iteration times, a round size and an attenuation factor.

Step 104: after initializing network parameters and states, at the beginning of each round, each time slot observes the spectrum occupation situation in the environment, the available channel is marked as 1, the unavailable channel is marked as-1, and 0 represents the time delay undefined occupation situation caused by the transmission distance. And finally, a matrix form of 16 x 9 is used for inputting the neural network. The state S at a certain moment can be expressed as:

the occupancy of other users follows markov properties in the time line, namely:

step 105: the state matrix in step 104 is sent to the Actor part of step 101 to connect to the network. The method comprises the steps of respectively outputting actions Ac and Ap in the action space of a channel and power, and adopting an epsilon-greedy strategy in order to obtain more sample resources and search the environment in the action selection process, namely:

because channel selection and power allocation can be regarded as a classification problem, the method selects a softmax activation function, and a discrete model is constructed by the method, and random logistic regression is performed after the output action of the Actor neural network is selected.

Step 106: the state S in step 104 and the actions Ac, ap in step 105 are entered into the environment. The environment includes the interfering part of the jammer, and fig. 3 shows a round-robin random interference pattern, i.e. 16 different channels are randomly attacked in 4 consecutive time slots. Under interference, the agent gets a prize R, and in order to maximize the throughput of the network, if the same channel is selected in the same time slot, R will be reduced by 1, and if the agent selects the same channel in two consecutive time slots, R will be reduced by 1. The flexible action decision can ensure effective channel access, and is more beneficial to disturbing the interference decision of the intelligent jammer so as to lead the intelligent jammer to learn in an error direction.

Step 107: the state S in step 104, together with the actions Ac, ap in step 105 and the time prize R in step 106, are sent to the Critic network to learn and output the time division Error (TD-Error), which calculates the pattern (17).

Step 108: the status, action, rewards and time-division errors involved in step 107 are re-sent to the Actor network for training learning. The gradient that the Actor network needs to calculate at this time is:

wherein A is _π,γ (s _t ,a _t ) As a potential function, A _π,γ (s,a)＝Q _π ,γ(s,a)-V _π Gamma(s), and then using the mean square error loss function to boost the decision:

wherein the decision can be made by rewarding r of each step _t And round prize R _tot To embody:

step 109: steps 104 to 106 are looped and the neural network parameters are updated after every 20 steps to maintain the stability of the Actor and Critic networks.

We compare the rewards R they obtain with the random case, as shown in figure 4. One hundred rounds are set, with 100 and slots per round, so the maximum prize per round is 300. The solid line part is a cooperative intelligent access mode based on a deep reinforcement learning Actor-Critic algorithm, and the dotted line is a random decision access mode. Both sides use probability transition channels with Markov property in time and are randomly interfered by round property. As can be seen from fig. 4, the access energy efficiency brought by the method is approximately 4 times of the random access energy efficiency, and the convergence stability can be maintained. Under the condition that other conditions are the same, an intelligent interference mode is set, the mode can also effectively learn the conversion mode of the channel, the effect of the two parties adopting an intelligent mode to game is shown in fig. 5, wherein the solid line part is a performance diagram of intelligent access under the condition of not considering power; the'-' curve is a performance diagram for both parties considering power limitation; the dashed line part of the method is a performance curve of the my interference in the interference mode of uniform power adopted by the adversary interference, and the method has interference resistance under the condition of considering limited power resources. In addition, the method adopts a cooperative multi-agent method, so that the flexibility of access is improved, the intelligent interference can not select the same channel to interfere in continuous time slots, and the learning difficulty of intelligent anti-interference of enemy can be greatly increased. In addition, in order to obtain greater access efficiency, two intelligent jammers can not select the same channel for access by unifying time slots, so that samples interfered by enemy can be poisoned and learned in the wrong direction. And the target network is used for soft updating of the computing network, so that the stability and the interference resistance of the decision can be improved.

While the foregoing is directed to embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow.

Claims

1. The multi-body cooperative satellite access and anti-interference method based on deep reinforcement learning is characterized by comprising the following steps of: comprises the following steps of the method,

2. The deep reinforcement learning-based multi-body cooperative satellite access and interference-free method of claim 1, wherein: the implementation method of the first step is that,

an aerospace integrated network is built, N intelligent users transmit information to satellites in the network, and M satellites are used for receiving the information, 1 traditional jammer and 1 intelligent jammer; the two jammers have the same opportunity to adopt a limited power access channel, and if a plurality of jammers and a user select the same channel in the same time slot, the user fails to transmit, and the jammers successfully interfere; otherwise, the user transmission is successful and the interference is avoided successfully; furthermore, the interference path of the jammer is partially observable.

3. The deep reinforcement learning-based multi-body cooperative satellite access and interference-free method of claim 1, wherein: the implementation method of the second step is that,

modeling large-scale fading based on a processable line of sight, i.e., loS, probabilistic model with shadow and blocking effects; in the LoS probability model, large-scale fading follows the generalized bernoulli distribution of two different events; the channel is LoS or non-LoS with a certain probability; because of the satellite access model, only LoS channels are considered, so the large scale fading between satellite m and user n is expressed as:

the large scale fading between satellite m and jammer j is expressed as:

wherein beta is ₀ Is the reference distance d ₀ The average power gain at=1, l is the vector of the three-dimensional space coordinate system,l ^(j) the position vectors of the satellite, the user and the jammer are respectively represented, and alpha is a path loss index;

the channel gain between satellite m and user n is expressed as:

the channel gain between satellite m and jammer j is expressed as:

wherein the method comprises the steps ofAnd->Is the effect of small-scale fading at time t, following the rice distribution;

the transmission power of the jammer and the user n on the kth channel of the mth satellite are respectivelyAnd p _n (t); thus, the channel capacity of user n on the kth channel of the mth satellite is:

where W is the bandwidth of the channel,representing gaussian noise power; the transmitting power of the user and the jammer satisfies +.> And->Wherein P is _tot Is the sum of the maximum power that each user can be allowed to use during the time slot, +.>The maximum power that can be used by the user and the adversary jammer, respectively.

4. The deep reinforcement learning-based multi-body cooperative satellite access and interference-free method of claim 1, wherein: the implementation method of the third step is that,

to ensure the stability of user access, the spectrum occupation condition of continuous time slots needs to be obtained; the available unoccupied channels are noted as 1, and the unavailable occupied channels are noted as-1; and using b as input to the neural network the spectrum occupancy of consecutive 9 and time slots _m (t) represents the observed channel conditions, b _m,k (t) =1 indicates that there is a successful access to the satellite by the user and that b _m,k (t) = -1 indicates that there is a user accessing the satellite but fails, b _m，k (t) =0 means that no user has access to the satellite;

wherein,the method comprises the steps that a user n accesses a kth channel of a satellite m at a time t, and u represents the occupation condition of the user; f (F) _n (t) two cases indicating that the user failed to access the satellite may be 0 and 1;

5. The deep reinforcement learning-based multi-body cooperative satellite access and interference-free method of claim 1, wherein: the implementation method of the fourth step is that,

the optimized target gradient of the Actor neural network at this time is as follows:

where ω is a value parameter, θ is a network parameter, qω (s _t ,a _t ) The current action Q value; pi _θ (a _t |s _t ) For the current moment policy E _π Is a policy expectation;

the mean square error loss function is adopted:

where r is the reward for the action, gamma is the decay factor, V _w (s _t ) Representing the state cost function at the current time, V (s _t+1 ) A state cost function representing the next time;

the Actor network selects a proper channel from the action space of the user, and simultaneously selects the used transmission power in the power action space, wherein the process is output action At; critic network calculates a next time state V value V (s _t+1 ) Outputting the time division error to judge the quality of the action, wherein the time division error is calculated in the following way:

TD-Error＝Q(s,a)-V _w (s _t )

＝r+gamma(s _t+1 )-V _w (s _t ) (10)

wherein Q (s, a) is the Q value at the current moment.

6. The deep reinforcement learning-based multi-body cooperative satellite access and interference-free method of claim 1, wherein: the implementation method of the fifth step is that,

the following conditions can occur in the interaction process of the actions of the intelligent body and the actions of the jammer with the environment, wherein the largest rewards of each step are 3, and the smallest rewards are-1; the following C _n,m,k (t) represents the transmission rate of the kth channel of the satellite m selected by the user n at the current moment in time t, C _threshold A rate threshold indicative of successful transmission;

(1) selecting the correct channel correct power

In the case, the intelligent user selects a channel which is not occupied by other users, and the interference of an enemy jammer is avoided correctly, at the moment, the intelligent user obtains a reward r=3, and if a plurality of intelligent users select the same channel, the reward is returned to r=2; if it isInterference is received but the power is sufficient to reach the transmission rate, C _n,m,k (t)>C _threshold At this time, r=2, if multiple intelligent users select the same channel, the reward is returned to r=1;

(2) selecting the correct channel error power

In this case, the intelligent user selects a channel which is not occupied by other users, but is interfered by the adversary, and the power is not enough to reach the transmission rate, C _n,m,k (t)<C _thrshold At this time, the prize is r=1, and if multiple smart users select the same channel, the prize is r=0

(3) Selecting the wrong channel

7. The deep reinforcement learning-based multi-body cooperative satellite access and interference-free method of claim 1, wherein: the implementation method of the step six is that,

TD-Error＝Q(s,a)-V _(s) ＝r+gamma*V(s′)-V(s) (11)

after time division Error TD-Error is calculated, the time division Error TD-Error is sent to an Actor network together with a state S and an action A to learn, wherein the TD-Error is a weight value when the Actor is updated; the Critic network does not need to estimate Q, but rather V; then the TD-Error can be calculated, i.e., advantage function, and then the TD-Error is minimized and then its expectation is calculated; the policy gradients that need to be calculated at this time are:

the state cost function V can be naturally used as Baseline, pi _θ For the current strategy, gamma is the decay factor, and an updated weight is obtained, wherein A is _π,γ (s _t ,a _t ) As a potential function, A _π,γ (s, a) =qpi, γ (s, a) -vpi, γ(s); q pi, gamma (s, a) are Q values under the action of attenuation factors, V pi, gamma(s) are state cost functions under the action of attenuation factors;

the reward value can be used as a decision direction of the loss function training agent, and the larger the reward value is, the smaller the difference between the reward value and the expected Q value is, so that the training effect is better; the smaller the reward value, the larger the loss value, which indicates that the action selection is not good; if the rewarding value in the continuous time slot is very small or negative, other actions are explored to update the strategy gradient so as to find a better strategy to improve the anti-interference capability of the intelligent agent until the best strategy is found, so that the rewarding value of the environment feedback intelligent agent is close to the maximum rewarding value, and the strategy is stably converged, namely the satellite access and anti-interference effect are realized.