CN117715054A - Multi-body cooperative satellite access and anti-interference method based on deep reinforcement learning - Google Patents

Multi-body cooperative satellite access and anti-interference method based on deep reinforcement learning Download PDF

Info

Publication number
CN117715054A
CN117715054A CN202311658293.8A CN202311658293A CN117715054A CN 117715054 A CN117715054 A CN 117715054A CN 202311658293 A CN202311658293 A CN 202311658293A CN 117715054 A CN117715054 A CN 117715054A
Authority
CN
China
Prior art keywords
user
satellite
channel
interference
access
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311658293.8A
Other languages
Chinese (zh)
Inventor
王洪圆
潘健雄
欧阳巧琳
王培森
齐斌
许鲁彦
叶能
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
32802 Troops Of People's Liberation Army Of China
Beijing Institute of Technology BIT
Original Assignee
32802 Troops Of People's Liberation Army Of China
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 32802 Troops Of People's Liberation Army Of China, Beijing Institute of Technology BIT filed Critical 32802 Troops Of People's Liberation Army Of China
Priority to CN202311658293.8A priority Critical patent/CN117715054A/en
Publication of CN117715054A publication Critical patent/CN117715054A/en
Pending legal-status Critical Current

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Mobile Radio Communication Systems (AREA)

Abstract

A multi-body cooperative satellite access and anti-interference method based on deep reinforcement learning belongs to the field of satellite communication. Constructing a part of link neural network by using an Actor-Critic offline learning method in deep reinforcement learning, and soft-updating neural network parameters by using a target network, so that decision performance in the countermeasure process is improved, and the method is better suitable for the change in an electromagnetic environment; in the state modeling of environment modeling and reinforcement learning, the action at the last moment is integrated into the state, and then different actions are output in continuous time slots by combining with the judgment of rewards, so that the intelligent access has more flexibility and variability, and the anti-interference capability of the access is improved; by using the GPU computing network and the offline strategy reinforcement learning method, sample acquisition training and effective intelligent access can be performed under the condition of lacking training samples and priori data. The invention is suitable for the satellite communication field, and improves the anti-interference capability under the condition of ensuring the access accuracy of users.

Description

Multi-body cooperative satellite access and anti-interference method based on deep reinforcement learning
Technical Field
The invention relates to a multi-body collaborative satellite access and anti-interference method based on deep reinforcement learning, belonging to the field of satellite communication.
Background
Satellite network mass access refers to providing high-speed internet access services to users in a wide area through satellite technology. This concept arose from the urgent need for worldwide internet popularity, particularly in those areas where geographic conditions are severe and infrastructure is relatively weak. The development of a satellite communication system is an important measure for occupying a space information network development high point and realizing a network national goal; the method can promote the industrialized development of the industry services such as navigation enhancement, wide area monitoring, data acquisition and distribution and the like, and is an important measure for pulling the comprehensive development of commercial aerospace and leading the information industry and the aerospace technology to be upgraded.
However, a first difficulty with large-scale access to satellite networks is that current random access protocols perform poorly in ultra-dense networks, requiring efficient access schemes that can handle large numbers of requests. Another challenge is interference attacks. To address access congestion issues, some enhanced random access schemes include priority, packet-based random access, and code-extended random access. Some studies also consider coded random access and sparse code multiple access. However, these schemes require a centralized scheduling mechanism, which is not available in wide area satellite access scenarios due to large propagation delays and large number of users. In order to resist interference attacks, common technical means include direct sequence spreading and frequency hopping spreading, as well as multi-beam antennas and adaptive anti-interference routing.
However, much work has been focused on random access mechanisms to increase success rates, and some also achieve interference immunity. Satellites are vulnerable to interference attacks due to the high openness. In a malicious interference environment, an jammer reduces channel quality by transmitting an interference signal, resulting in access failure. In addition, when the device is inaccessible, it may continually attempt retransmission, causing a dilemma of rapid battery discharge and increased channel blockage. Thus, there is a need for an advanced random access scheme to support large-scale operation of satellite networks under interference attacks. The conventional anti-interference method cannot cope with intelligent interference of adjusting an interference strategy according to user behaviors.
Therefore, for the multi-body cooperative intelligent access satellite and anti-interference problem of the complex electromagnetic environment, the traditional access and anti-interference method is referred, intelligent interference of a continuously-changing strategy is also faced, the deep reinforcement learning algorithm is combined, the change of the electromagnetic environment is continuously observed, the transformation rule of the interference is learned, and the multi-satellite access efficiency can be further improved better.
Disclosure of Invention
Aiming at the problems of insufficient anti-interference capability and poor environmental adaptability of the existing satellite access technology, the main purpose of the invention is to provide a multi-body collaborative satellite access and anti-interference method based on deep reinforcement learning, which combines a deep neural network with a traditional reinforcement learning Q-learning by adopting an Actor-Critic algorithm in the deep reinforcement learning, and sets two modes of traditional artificial interference and intelligent interference in an environment of satellite-ground collaborative access, and gives consideration to both transmission delay and resource allocation of transmission power. And under the condition of ensuring the access accuracy of the user, the anti-interference capability is improved.
The invention aims at realizing the following technical scheme:
the invention discloses a multi-agent cooperative electromagnetic anti-interference method based on deep reinforcement learning, which comprises the steps of firstly building a complex electromagnetic environment of multi-agent reinforcement learning, wherein the complex electromagnetic environment comprises transmission delay, signal fading and noise interference; the transmission channel is a time-varying channel with markov properties; the partial connection neural network is adopted, so that two actions of channel selection and power distribution can be simultaneously output; judging whether the action is good or not by adopting a denser rewarding mode; the agent cannot select the same channel in consecutive time slots to increase the variability of its decision; through multiple round iterations, the access capability and the anti-interference capability are continuously improved.
The invention discloses a multi-body cooperative satellite access and anti-interference method based on deep reinforcement learning, which comprises the following steps:
step one: constructing a complex electromagnetic environment of multiple intelligent agents;
an aerospace integrated network is built, and N intelligent users transmit information to satellites in the network, wherein M satellites are used for receiving the information, and 1 traditional jammer and 1 intelligent jammer are arranged in the network. The two jammers have the same opportunity to adopt a limited power access channel, and if a plurality of jammers and a user select the same channel in the same time slot, the user fails to transmit, and the jammers successfully interfere; otherwise, the user transmission is successful and the interference is avoided successfully. Furthermore, the interference path of the jammer is partially observable.
Step two: in the electromagnetic environment in the first step, a Cartesian space three-dimensional coordinate system and an Actor-Critic neural network are built;
the large-scale fading is modeled based on a processable line of sight, i.e., loS, probabilistic model with shadow and occlusion effects. In the LoS probability model, large-scale fading follows the generalized bernoulli distribution of two different events; the channel is LoS or non-LoS (NLoS) with a certain probability. Because of the satellite access model, only LoS channels are considered, so the large scale fading between satellite m and user n is expressed as:
the large scale fading between satellite m and jammer j is expressed as:
wherein beta is 0 Is the reference distance d 0 The average power gain at=1, l is the vector of the three-dimensional space coordinate system,the position vectors of the satellite, the user and the jammer are respectively represented, and alpha is a path loss index.
The channel gain between satellite m and user n is expressed as:
the channel gain between satellite m and jammer j is expressed as:
wherein the method comprises the steps ofAnd->Is the effect of small scale fading at time t, following the rice (Rician) distribution.
The transmission power of the jammer and the user n on the kth channel of the mth satellite are respectivelyAnd p n (t). Thus, the channel capacity of user n on the kth channel of the mth satellite is:
where W is the bandwidth of the channel,representing gaussian noise power. The transmitting power of the user and the jammer is satisfied T epsilon T sumWherein P is tot Is the maximum sum of the power that each user can be allowed to use at that time slot,the maximum power that can be used by the user and the adversary jammer, respectively.
Step three: according to the position coordinates of the three-dimensional space coordinate system in the second step, the acquired frequency spectrum information is dataized to obtain the input state of the neural network;
to ensure the stability of user access, the spectrum occupation condition of continuous time slots needs to be obtained. The available unoccupied channels are denoted as 1 and the unavailable occupied channels are denoted as-1. And using b as input to the neural network the spectrum occupancy of consecutive 9 and time slots m (t) represents the observed channel conditions, b m,k (t) =1 indicates that there is a successful access to the satellite by the user and that b m,k (t) = -1 indicates that there is a user accessing the satellite but fails, b m,k (t) =0 means that no user has access to the satellite.
Wherein,indicating that user n has access to satellite m at time tThe kth channel, u, represents the occupancy of the user. F (F) n (t) two cases indicating that the user failed to access the satellite may be 0 and 1.
Definition b m (t)=[b m,1 (t),b m,2 (t)····b m,k (t)],B(t)=[b 1 (t),b 2 (t)····b M (t)] T To represent the occupancy of the channel.
Step four: after the state input in the third step is obtained, respectively inputting an Actor network and a Critic network, and enabling an intelligent body to perform two actions of selecting proper channels and power to resist an interference machine;
the optimized target gradients of the two neural networks at this time are as follows:
where ω is a value parameter, θ is a network parameter, qω (s t ,a t ) The current action Q value. Pi θ (a t |s t ) For the current moment policy E π Is a policy expectation.
The mean square error loss function is adopted:
where r is the reward for the action, gamma is the decay factor, V w (s t ) Representing the state cost function at the current time, V (s t+1 ) Representing the state cost function at the next moment.
The Actor network selects a proper channel from the action space of the user, and simultaneously selects the used transmission power in the power action space, and the process is output action At. Critic network calculates a next time state V value V (s t+1 ) Outputting the time division error to judge the quality of the action, wherein the time division error is calculated in the following way:
TD-Error=Q(s,a)-V w (s t )
=r+gamma*V(s t+1 )-V w (s t ) (10)
wherein Q (s, a) is the Q value at the current moment.
Step five: the state in the third step is interacted with the action in the fourth step in the input environment to obtain the rewards of environment feedback;
the following conditions can occur in the interaction process of the actions of the intelligent body and the actions of the jammer with the environment, wherein the maximum rewards of each step are 3, and the minimum rewards are-1. The following C n,m,k (t) represents the transmission rate of the kth channel of the satellite m selected by the user n at the current moment in time t, C threshold Indicating the rate threshold for successful transmission.
(1) Selecting the correct channel correct power
In the case, the intelligent user selects a channel which is not occupied by other users, and the interference of an enemy jammer is avoided correctly, at the moment, the intelligent user obtains a reward r=3, and if a plurality of intelligent users select the same channel, the reward is returned to r=2; if interference is received but the power is sufficient to reach the transmission rate, C n,m,k (t)>C threshold At this point r=2, if multiple smart users select the same channel, the reward is back to r=1.
(2) Selecting the correct channel error power
In this case, the intelligent user selects a channel which is not occupied by other users, but is interfered by the adversary, and the power is not enough to reach the transmission rate, C n,m,k (t)<C threshold At this time, the prize is r=1, and if multiple smart users select the same channel, the prize is r=0
(3) Selecting the wrong channel
In this case, the intelligent user selects a channel identical to the other users, and when the intelligent user is not interfered by the hostile jammer, the reward is r=0, and if the intelligent user is interfered by the jammer, r= -1.
Step six: in the middle state of the third step, the actions in the fourth step and the rewards in the fifth step are input to the Actor and the Critic network together for experience learning, network parameters are optimized and updated, and the anti-interference effect of the satellite is achieved.
Firstly, learning a Critic network, obtaining a state St+1 at the next moment through environmental observation, obtaining a V value of the state at the next moment through network calculation, and calculating a Belman equation:
TD-Error=Q(s,a)-V (s) =r+gamma*V(s′)-V(s) (11)
after the time division Error TD-Error is calculated, the time division Error TD-Error is sent to an Actor network together with the state S and the action A for learning, and the TD-Error is a weight value when the Actor is updated. The Critic network does not need to estimate Q but rather V. The TD-Error can then be calculated, i.e., advantage function, and then minimized and expected. The policy gradients that need to be calculated at this time are:
the state cost function V can be naturally used as Baseline, pi θ For the current strategy, gamma is the decay factor, and an updated weight is obtained, wherein A is π,γ (s t ,a t ) As a potential function, A π,γ (s,a)=Qπ,γ(s,a)-Vπ,γ(s)。
Q pi, gamma (s, a) are Q values under the action of attenuation factors, V pi, gamma(s) are state cost functions under the action of attenuation factors.
The reward value can be used as a decision direction of the loss function training agent, and the larger the reward value is, the smaller the difference between the reward value and the expected Q value is, so that the training effect is better; the smaller the prize value, the larger the loss value, indicating poor action selection. If the rewarding value in the continuous time slot is very small or negative, other actions are explored to update the strategy gradient so as to find a better strategy to improve the anti-interference capability of the intelligent agent until the best strategy is found, so that the rewarding value of the environment feedback intelligent agent is close to the maximum rewarding value, and the strategy is stably converged, namely the satellite access and anti-interference effect are realized.
The beneficial effects are that:
1. the invention discloses a multi-body cooperative electromagnetic anti-interference method based on deep reinforcement learning, which uses an Actor-Critic offline learning algorithm in the deep reinforcement learning. And (3) constructing part of the connecting neural network to respectively perform parameter training on each network branch, so that the intelligent network access system and the intelligent network access method have the advantages of taking up more resources on resource allocation compared with other intelligent access technologies. In addition, the target network is used for soft updating of the neural network parameters, so that decision performance in the countermeasure process is improved, and the method is better suitable for changes in the electromagnetic environment.
2. In the multi-body cooperative electromagnetic anti-interference method based on deep reinforcement learning, in the state modeling of environment modeling and reinforcement learning, not only continuous spectrum information is regarded as state input, but also actions at the last moment are fused into the state, and then different actions can be output in continuous time slots by combining with judgment of rewards, so that intelligent access has flexibility and variability, and strategies are difficult to learn by intelligent interference. And under the condition of ensuring the access of the self channel resources, the anti-interference capability of the access is improved.
3. The multi-body collaborative electromagnetic anti-interference method based on deep reinforcement learning disclosed by the invention uses a GPU (graphic processing unit) computing network and an offline strategy reinforcement learning algorithm, can also perform sample acquisition training and effective intelligent access under the condition of lacking training samples and priori data, and simulates more complex and real electromagnetic countermeasure environments.
Drawings
FIG. 1 is a schematic flow chart of a multi-body cooperative electromagnetic anti-interference method based on deep reinforcement learning disclosed by the method;
FIG. 2 is a partially connected neural network constructed in this embodiment;
fig. 3 is a schematic diagram of random access in a satellite network under a malicious interference attack in the present embodiment;
FIG. 4 is an access performance diagram of the enemy-resistant conventional jammer in the present embodiment;
fig. 5 is an access performance diagram of the enemy-resistant intelligent jammer in the present embodiment.
Detailed Description
The present invention will be described in detail with reference to the accompanying drawings and examples. The technical problems and the beneficial effects solved by the technical proposal of the invention are also described, and the described embodiment is only used for facilitating the understanding of the invention and does not have any limiting effect.
The multi-agent cooperative electromagnetic anti-interference method based on deep reinforcement learning disclosed by the embodiment is applied to a TensorFlow2.0 and Tensorlayer2.0 framework, adopts a deep reinforcement learning Actor-Critic algorithm, replaces a traditional Q table calculation mode by using a part of connected neural network, bridges the processed spectrum information with electromagnetic environment observation based on an optimization model, sends the bridged state into the network to calculate to obtain an output action and an output action Q value respectively, and finally carries out empirical training on the state, the action, the rewarding and the Q value of the whole process to update parameters. The specific parameter settings are as follows:
attenuation factor gamma 0.95
Actor learning rate 0.0001
Critic learning rate 0.001
Number of channels 16
Training batch 300
Every round of time slots 109
Greedy factor epsilon 0.15
Optimizer Adam
As shown in fig. 1, the method comprises the following steps:
step 101: and establishing an Actor part of an Actor-Critic algorithm to be connected with the neural network.
The part carrying one 3 layers is connected with an Actor neural network, wherein the first layer is a public layer and contains 256 neurons; the second layer is a normalization layer (Batchs Normalization) which can improve the convergence rate of the neural network and the generalization capability of the network to prevent overfitting. The normalized form is as follows:
y i ←γx i +β=BN γ,B (x i ) (16)
where m is the number of samples, x i Mu, as sample value B As the mean value of the sample,for sample variance, gamma and beta are respectively introduced scale factors and shifting factors, E is a human setting, the denominator is prevented from being 0, BN γ,B (x i ) To pass through Bn layerPost-sample data, y i And outputting the sample correspondingly.
The process is that m sample values x are obtained first i Calculating the sample mean mu B Re-calculating the sample varianceThe sample data is standardized, two parameters of a gamma scale factor and a beta translation factor are introduced, and the parameters are trained, so that the network learning can be enabled to recover the feature distribution required to be learned of the original network, and the normalized sample value ∈10 is calculated>Normalization can accelerate the convergence of the network, and the mean and mean square error can be considered as introducing noise, i.e., preventing overfitting.
The third layer is a 2 partially connected branch, each containing 128 neurons. The network structure is shown in fig. 2. The first branch outputs the selected action in the action space of channel selection, and the second branch outputs the action of distributing the power magnitude in the action space of power magnitude. Since the purposes of the two actions are different, and the directions of the calculated gradients are different, two different sets of network parameters need to be trained respectively.
Furthermore both branches may use the Adam optimizer identically and the same learning rate is 0.001.
Step 102: constructing fully connected Critic networks
The Critic portion of the algorithm is similar to the Actor portion, with 256 neurons in the first layer, followed by the Bn layer, and then the output layer. However, only one neuron is at the final output of the network, and the output value is time division Error (TD-Error) and is backward TD Error, so as to judge whether the output action of the Actor network is good or bad.
The calculation form of the TD error is as follows:
TD-Error=r+gamma*V(s′)-V(s) (17)
similarly, adam optimizer is selected for use with a learning rate of 0.01
Step 103: the parameters required for other training of the network are configured. Setting a learning rate, a batch processing size, a weight initialization mode, a weight attenuation coefficient, an optimization method, iteration times, a round size and an attenuation factor.
Step 104: after initializing network parameters and states, at the beginning of each round, each time slot observes the spectrum occupation situation in the environment, the available channel is marked as 1, the unavailable channel is marked as-1, and 0 represents the time delay undefined occupation situation caused by the transmission distance. And finally, a matrix form of 16 x 9 is used for inputting the neural network. The state S at a certain moment can be expressed as:
the occupancy of other users follows markov properties in the time line, namely:
step 105: the state matrix in step 104 is sent to the Actor part of step 101 to connect to the network. The method comprises the steps of respectively outputting actions Ac and Ap in the action space of a channel and power, and adopting an epsilon-greedy strategy in order to obtain more sample resources and search the environment in the action selection process, namely:
because channel selection and power allocation can be regarded as a classification problem, the method selects a softmax activation function, and a discrete model is constructed by the method, and random logistic regression is performed after the output action of the Actor neural network is selected.
Step 106: the state S in step 104 and the actions Ac, ap in step 105 are entered into the environment. The environment includes the interfering part of the jammer, and fig. 3 shows a round-robin random interference pattern, i.e. 16 different channels are randomly attacked in 4 consecutive time slots. Under interference, the agent gets a prize R, and in order to maximize the throughput of the network, if the same channel is selected in the same time slot, R will be reduced by 1, and if the agent selects the same channel in two consecutive time slots, R will be reduced by 1. The flexible action decision can ensure effective channel access, and is more beneficial to disturbing the interference decision of the intelligent jammer so as to lead the intelligent jammer to learn in an error direction.
Step 107: the state S in step 104, together with the actions Ac, ap in step 105 and the time prize R in step 106, are sent to the Critic network to learn and output the time division Error (TD-Error), which calculates the pattern (17).
Step 108: the status, action, rewards and time-division errors involved in step 107 are re-sent to the Actor network for training learning. The gradient that the Actor network needs to calculate at this time is:
wherein A is π,γ (s t ,a t ) As a potential function, A π,γ (s,a)=Q π ,γ(s,a)-V π Gamma(s), and then using the mean square error loss function to boost the decision:
wherein the decision can be made by rewarding r of each step t And round prize R tot To embody:
step 109: steps 104 to 106 are looped and the neural network parameters are updated after every 20 steps to maintain the stability of the Actor and Critic networks.
We compare the rewards R they obtain with the random case, as shown in figure 4. One hundred rounds are set, with 100 and slots per round, so the maximum prize per round is 300. The solid line part is a cooperative intelligent access mode based on a deep reinforcement learning Actor-Critic algorithm, and the dotted line is a random decision access mode. Both sides use probability transition channels with Markov property in time and are randomly interfered by round property. As can be seen from fig. 4, the access energy efficiency brought by the method is approximately 4 times of the random access energy efficiency, and the convergence stability can be maintained. Under the condition that other conditions are the same, an intelligent interference mode is set, the mode can also effectively learn the conversion mode of the channel, the effect of the two parties adopting an intelligent mode to game is shown in fig. 5, wherein the solid line part is a performance diagram of intelligent access under the condition of not considering power; the'-' curve is a performance diagram for both parties considering power limitation; the dashed line part of the method is a performance curve of the my interference in the interference mode of uniform power adopted by the adversary interference, and the method has interference resistance under the condition of considering limited power resources. In addition, the method adopts a cooperative multi-agent method, so that the flexibility of access is improved, the intelligent interference can not select the same channel to interfere in continuous time slots, and the learning difficulty of intelligent anti-interference of enemy can be greatly increased. In addition, in order to obtain greater access efficiency, two intelligent jammers can not select the same channel for access by unifying time slots, so that samples interfered by enemy can be poisoned and learned in the wrong direction. And the target network is used for soft updating of the computing network, so that the stability and the interference resistance of the decision can be improved.
While the foregoing is directed to embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow.

Claims (7)

1. The multi-body cooperative satellite access and anti-interference method based on deep reinforcement learning is characterized by comprising the following steps of: comprises the following steps of the method,
step one: constructing a complex electromagnetic environment of multiple intelligent agents;
step two: in the electromagnetic environment in the first step, a Cartesian space three-dimensional coordinate system and an Actor-Critic neural network are built;
step three: according to the position coordinates of the three-dimensional space coordinate system in the second step, the acquired frequency spectrum information is dataized to obtain the input state of the neural network;
step four: after the state input in the third step is obtained, respectively inputting an Actor network and a Critic network, and enabling an intelligent body to perform two actions of selecting proper channels and power to resist an interference machine;
step five: the state in the third step is interacted with the action in the fourth step in the input environment to obtain the rewards of environment feedback;
step six: in the middle state of the third step, the actions in the fourth step and the rewards in the fifth step are input to the Actor and the Critic network together for experience learning, network parameters are optimized and updated, and the anti-interference effect of the satellite is achieved.
2. The deep reinforcement learning-based multi-body cooperative satellite access and interference-free method of claim 1, wherein: the implementation method of the first step is that,
an aerospace integrated network is built, N intelligent users transmit information to satellites in the network, and M satellites are used for receiving the information, 1 traditional jammer and 1 intelligent jammer; the two jammers have the same opportunity to adopt a limited power access channel, and if a plurality of jammers and a user select the same channel in the same time slot, the user fails to transmit, and the jammers successfully interfere; otherwise, the user transmission is successful and the interference is avoided successfully; furthermore, the interference path of the jammer is partially observable.
3. The deep reinforcement learning-based multi-body cooperative satellite access and interference-free method of claim 1, wherein: the implementation method of the second step is that,
modeling large-scale fading based on a processable line of sight, i.e., loS, probabilistic model with shadow and blocking effects; in the LoS probability model, large-scale fading follows the generalized bernoulli distribution of two different events; the channel is LoS or non-LoS with a certain probability; because of the satellite access model, only LoS channels are considered, so the large scale fading between satellite m and user n is expressed as:
the large scale fading between satellite m and jammer j is expressed as:
wherein beta is 0 Is the reference distance d 0 The average power gain at=1, l is the vector of the three-dimensional space coordinate system,l (j) the position vectors of the satellite, the user and the jammer are respectively represented, and alpha is a path loss index;
the channel gain between satellite m and user n is expressed as:
the channel gain between satellite m and jammer j is expressed as:
wherein the method comprises the steps ofAnd->Is the effect of small-scale fading at time t, following the rice distribution;
the transmission power of the jammer and the user n on the kth channel of the mth satellite are respectivelyAnd p n (t); thus, the channel capacity of user n on the kth channel of the mth satellite is:
where W is the bandwidth of the channel,representing gaussian noise power; the transmitting power of the user and the jammer satisfies +.> And->Wherein P is tot Is the sum of the maximum power that each user can be allowed to use during the time slot, +.>The maximum power that can be used by the user and the adversary jammer, respectively.
4. The deep reinforcement learning-based multi-body cooperative satellite access and interference-free method of claim 1, wherein: the implementation method of the third step is that,
to ensure the stability of user access, the spectrum occupation condition of continuous time slots needs to be obtained; the available unoccupied channels are noted as 1, and the unavailable occupied channels are noted as-1; and using b as input to the neural network the spectrum occupancy of consecutive 9 and time slots m (t) represents the observed channel conditions, b m,k (t) =1 indicates that there is a successful access to the satellite by the user and that b m,k (t) = -1 indicates that there is a user accessing the satellite but fails, b m,k (t) =0 means that no user has access to the satellite;
wherein,the method comprises the steps that a user n accesses a kth channel of a satellite m at a time t, and u represents the occupation condition of the user; f (F) n (t) two cases indicating that the user failed to access the satellite may be 0 and 1;
definition b m (t)=[b m,1 (t),b m,2 (t)····b m,k (t)],B(t)=[b 1 (t),b 2 (t)····b M (t)] T To represent the occupancy of the channel.
5. The deep reinforcement learning-based multi-body cooperative satellite access and interference-free method of claim 1, wherein: the implementation method of the fourth step is that,
the optimized target gradient of the Actor neural network at this time is as follows:
where ω is a value parameter, θ is a network parameter, qω (s t ,a t ) The current action Q value; pi θ (a t |s t ) For the current moment policy E π Is a policy expectation;
the mean square error loss function is adopted:
where r is the reward for the action, gamma is the decay factor, V w (s t ) Representing the state cost function at the current time, V (s t+1 ) A state cost function representing the next time;
the Actor network selects a proper channel from the action space of the user, and simultaneously selects the used transmission power in the power action space, wherein the process is output action At; critic network calculates a next time state V value V (s t+1 ) Outputting the time division error to judge the quality of the action, wherein the time division error is calculated in the following way:
TD-Error=Q(s,a)-V w (s t )
=r+gamma(s t+1 )-V w (s t ) (10)
wherein Q (s, a) is the Q value at the current moment.
6. The deep reinforcement learning-based multi-body cooperative satellite access and interference-free method of claim 1, wherein: the implementation method of the fifth step is that,
the following conditions can occur in the interaction process of the actions of the intelligent body and the actions of the jammer with the environment, wherein the largest rewards of each step are 3, and the smallest rewards are-1; the following C n,m,k (t) represents the transmission rate of the kth channel of the satellite m selected by the user n at the current moment in time t, C threshold A rate threshold indicative of successful transmission;
(1) selecting the correct channel correct power
In the case, the intelligent user selects a channel which is not occupied by other users, and the interference of an enemy jammer is avoided correctly, at the moment, the intelligent user obtains a reward r=3, and if a plurality of intelligent users select the same channel, the reward is returned to r=2; if it isInterference is received but the power is sufficient to reach the transmission rate, C n,m,k (t)>C threshold At this time, r=2, if multiple intelligent users select the same channel, the reward is returned to r=1;
(2) selecting the correct channel error power
In this case, the intelligent user selects a channel which is not occupied by other users, but is interfered by the adversary, and the power is not enough to reach the transmission rate, C n,m,k (t)<C thrshold At this time, the prize is r=1, and if multiple smart users select the same channel, the prize is r=0
(3) Selecting the wrong channel
In this case, the intelligent user selects a channel identical to the other users, and when the intelligent user is not interfered by the hostile jammer, the reward is r=0, and if the intelligent user is interfered by the jammer, r= -1.
7. The deep reinforcement learning-based multi-body cooperative satellite access and interference-free method of claim 1, wherein: the implementation method of the step six is that,
firstly, learning a Critic network, obtaining a state St+1 at the next moment through environmental observation, obtaining a V value of the state at the next moment through network calculation, and calculating a Belman equation:
TD-Error=Q(s,a)-V (s) =r+gamma*V(s′)-V(s) (11)
after time division Error TD-Error is calculated, the time division Error TD-Error is sent to an Actor network together with a state S and an action A to learn, wherein the TD-Error is a weight value when the Actor is updated; the Critic network does not need to estimate Q, but rather V; then the TD-Error can be calculated, i.e., advantage function, and then the TD-Error is minimized and then its expectation is calculated; the policy gradients that need to be calculated at this time are:
the state cost function V can be naturally used as Baseline, pi θ For the current strategy, gamma is the decay factor, and an updated weight is obtained, wherein A is π,γ (s t ,a t ) As a potential function, A π,γ (s, a) =qpi, γ (s, a) -vpi, γ(s); q pi, gamma (s, a) are Q values under the action of attenuation factors, V pi, gamma(s) are state cost functions under the action of attenuation factors;
the reward value can be used as a decision direction of the loss function training agent, and the larger the reward value is, the smaller the difference between the reward value and the expected Q value is, so that the training effect is better; the smaller the reward value, the larger the loss value, which indicates that the action selection is not good; if the rewarding value in the continuous time slot is very small or negative, other actions are explored to update the strategy gradient so as to find a better strategy to improve the anti-interference capability of the intelligent agent until the best strategy is found, so that the rewarding value of the environment feedback intelligent agent is close to the maximum rewarding value, and the strategy is stably converged, namely the satellite access and anti-interference effect are realized.
CN202311658293.8A 2023-12-05 2023-12-05 Multi-body cooperative satellite access and anti-interference method based on deep reinforcement learning Pending CN117715054A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311658293.8A CN117715054A (en) 2023-12-05 2023-12-05 Multi-body cooperative satellite access and anti-interference method based on deep reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311658293.8A CN117715054A (en) 2023-12-05 2023-12-05 Multi-body cooperative satellite access and anti-interference method based on deep reinforcement learning

Publications (1)

Publication Number Publication Date
CN117715054A true CN117715054A (en) 2024-03-15

Family

ID=90156178

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311658293.8A Pending CN117715054A (en) 2023-12-05 2023-12-05 Multi-body cooperative satellite access and anti-interference method based on deep reinforcement learning

Country Status (1)

Country Link
CN (1) CN117715054A (en)

Similar Documents

Publication Publication Date Title
CN111800828B (en) Mobile edge computing resource allocation method for ultra-dense network
CN111628855B (en) Industrial 5G dynamic multi-priority multi-access method based on deep reinforcement learning
CN113162679A (en) DDPG algorithm-based IRS (inter-Range instrumentation System) auxiliary unmanned aerial vehicle communication joint optimization method
CN114389678B (en) Multi-beam satellite resource allocation method based on decision performance evaluation
CN113098714B (en) Low-delay network slicing method based on reinforcement learning
US20220286199A1 (en) Method for predicting co-channel interference of satellite-to-ground downlink under low-orbit satellite constellation
CN114900225B (en) Civil aviation Internet service management and access resource allocation method based on low-orbit giant star base
CN103916355A (en) Distribution method for sub carriers in cognitive OFDM network
CN116248164A (en) Fully distributed routing method and system based on deep reinforcement learning
CN112260733A (en) Multi-agent deep reinforcement learning-based MU-MISO hybrid precoding design method
CN115499921A (en) Three-dimensional trajectory design and resource scheduling optimization method for complex unmanned aerial vehicle network
CN110505681B (en) Non-orthogonal multiple access scene user pairing method based on genetic method
CN113490219A (en) Dynamic resource allocation method for ultra-dense networking
CN117715054A (en) Multi-body cooperative satellite access and anti-interference method based on deep reinforcement learning
Shaodong et al. Multi-step reinforcement learning-based offloading for vehicle edge computing
Hu et al. Dynamic power allocation in high throughput satellite communications: A two-stage advanced heuristic learning approach
CN105989407A (en) Neural network based short wave median field intensity prediction system, method and device
You et al. On parallel immune quantum evolutionary algorithm based on learning mechanism and its convergence
CN114268348A (en) Honeycomb-free large-scale MIMO power distribution method based on deep reinforcement learning
CN113919495A (en) Multi-agent fault-tolerant consistency method and system based on reinforcement learning
CN114364034A (en) RIS assisted user centralized de-cellular system resource management semi-parallel method based on DRL
CN114189939A (en) Dynamic beam-hopping and wave-band bandwidth allocation method based on multi-agent reinforcement learning
CN110084360B (en) MVB period real-time improvement algorithm based on GA-BP
CN113595609A (en) Cellular mobile communication system cooperative signal sending method based on reinforcement learning
Chen et al. Planning optimization of the distributed antenna system in high-speed railway communication network based on improved cuckoo search

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination