CN115806062A

CN115806062A - Modeling method, model and acquisition method of satellite relative phase holding strategy model

Info

Publication number: CN115806062A
Application number: CN202211410211.3A
Authority: CN
Inventors: 吴琳琳; 吴新林; 何镇武; 吴凌根; 陈倩茹; 王丽颖; 张琳娜
Original assignee: Emposat Co Ltd
Current assignee: Beijing Aerospace Yuxing Technology Co.,Ltd.
Priority date: 2022-11-10
Filing date: 2022-11-10
Publication date: 2023-03-17
Anticipated expiration: 2042-11-10
Also published as: CN115806062B

Abstract

The invention relates to the field of aerospace, and provides a modeling method, a model, an obtaining method, equipment and a medium of a satellite relative phase keeping strategy model, wherein the modeling method comprises the following steps: s1: acquiring a plurality of satellite training state data sets; s2: obtaining all semi-major axis control behaviors and corresponding output Q values after the initial moment; s3: acquiring states of a first satellite and a second satellite at the current moment; obtaining semi-major axis control behavior; s4: obtaining the reward and the states of the first satellite and the second satellite at the next moment; s5: storing the satellite combination state data set into an experience pool; s6: calculating a target value; s7: calculating an error; s8: updating the Q value; taking the states of the first satellite and the second satellite at the next moment as the states of the first satellite and the second satellite at the current moment; s9: repeating S3-S8, and updating the weight parameter of the target neural network; s10: and repeating S2-S9 until the data input is finished. The scheme can obtain an optimal decision strategy and reduce the consumption of satellite fuel.

Description

Modeling method, modeling method and obtaining method of satellite relative phase holding strategy model

Technical Field

The invention relates to the technical field of aerospace, in particular to a modeling method, a modeling, an obtaining method, equipment and a medium of a satellite relative phase keeping strategy model based on Nature DQN.

Background

With the continuous development of human aerospace activities, more and more remote sensing satellites provide help for the daily life of people.

The satellite constellation generally requires each satellite in the constellation to maintain a certain phase in the operation process, and due to the influence of various perturbation factors in the orbit entering and operation processes, a certain error exists between the actual phase and the nominal phase of the satellite, and when the performance of the constellation is reduced by the magnitude of the error, the phase of the satellite should be controlled, so that the error between the actual phase and the nominal phase of the satellite is eliminated.

The complete autonomous orbit maintenance can effectively reduce the running cost of the satellite and improve the capability of the satellite for dealing with emergencies. If autonomous orbit preservation of MEO satellites can be achieved, the operational capability of the constellation can be greatly improved while reducing maintenance costs. Satellites with fully autonomous orbit preservation capabilities must have fully autonomous navigation and orbital control. The life of the satellite will be determined primarily by the fuel it carries, and an effective phase control method will extend the life of the satellite.

The method comprises the steps of firstly analyzing satellite phase changes caused by the influence of various perturbation forces such as earth shapes, sun-moon gravitation and the like on satellites in a constellation in the orbit running process through a dynamic model, then obtaining a conclusion that the phase deviation can be indirectly eliminated by adjusting a semilong axis according to the relation between the phase deviation and the semilong axis deviation, then designing a strategy of relative phase maintenance, and further optimizing maintenance parameters and calculating the consumption of propellant. In the prior art, various types of perturbation force applied to the satellite in the process of orbital operation are subjected to complex modeling, however, due to the complexity of space stress and the uncertainty of parameters of the satellite, the satellite cannot be subjected to accurate modeling, the number of parameters is large, the calculation is complex, the satellite phase maintaining precision is further influenced, and more fuel may be consumed.

Therefore, it is urgently needed to develop a modeling method, a model, an obtaining method, equipment and a medium of a satellite relative phase keeping strategy model based on Nature DQN, reduce the modeling difficulty and accurately calculate a relative phase keeping strategy.

Disclosure of Invention

The invention aims to provide a modeling method, a model, an obtaining method, equipment and a medium of a satellite relative phase maintenance strategy model, which do not need to carry out complex modeling when carrying out relative phase position maintenance on an MEO triaxial stable satellite, do not need to consider the complexity of space stress and the uncertainty of the parameters of the satellite, have strong behavior decision-making capability in reinforcement learning, can obtain an optimal decision-making strategy and reduce the consumption of satellite fuel.

In order to solve the above technical problem, as an aspect of the present invention, there is provided a method for modeling a satellite relative phase preservation strategy model based on Nature DQN, comprising the steps of:

s1: initializing a model, and acquiring a plurality of groups of satellite training state data sets, wherein each group of satellite training state data set comprises states of a first satellite and a second satellite at an initial moment, a plurality of expected orbit control moments and expected orbit control times; the states of the first satellite and the second satellite comprise relative phase differences of the first satellite and the second satellite;

s2: inputting the states of a first satellite and a second satellite at the initial time of a group of satellite training state data sets into the model to obtain all semi-major axis control behaviors after the initial time and corresponding output Q values;

s3: acquiring the states of a first satellite and a second satellite at the current moment, and acquiring a semi-long axis control behavior executed by the first satellite or the second satellite according to a greedy strategy;

s4: executing a semi-long axis control action to obtain the states of the first satellite and the second satellite at the next moment, and obtaining a reward according to the states and the relative phase keeping strategy reward function of the first satellite and the second satellite at the next moment; the relative phase holding strategy reward function adopts formula 1:

wherein r is _t Awards, Δ λ, for semi-major axis control actions by the first satellite or the second satellite at the present time ₀ Relative phase difference, Δ λ, between the first satellite and the second satellite in nominal orbit _s Maintaining a threshold, Δ λ, for a relative phase difference of a first satellite and a second satellite _t+1 Is the relative phase difference between the first satellite and the second satellite at the next moment, | Δ λ _t+1 -Δλ ₀ The method comprises the steps that I, after semi-major axis control action is carried out on a first satellite or a second satellite at the current moment, the relative phase difference between the first satellite and the second satellite at the next moment extrapolated to the current moment is changed relative to a nominal orbit, namely the influence of the semi-major axis control action carried out on the first satellite or the second satellite at the current moment on the relative phase difference between the first satellite and the second satellite is obtained; t is the current time; t is t ₀ The expected orbit control time closest to the current time is obtained;

s5: storing the states of the first satellite and the second satellite at the current moment, the semi-long axis control action executed by the first satellite or the second satellite, the reward and the states of the first satellite and the second satellite at the next moment into an experience pool as a group of satellite combination state data sets;

s6: taking out a plurality of satellite combined state data sets from the experience pool, and calculating the target value of each combined satellite state data set according to the target neural network weight parameter;

s7: calculating an error according to the loss function, and updating the weight parameter of the current neural network;

s8: updating the Q value according to the value function; taking the states of the first satellite and the second satellite at the next moment as the states of the first satellite and the second satellite at the current moment;

s9: repeating steps S3-S8, wherein the times for executing steps S3-S8 is equal to the expected orbit control times of the set of satellite training state data; after the steps S3-S8 of the appointed iteration times are repeatedly executed, updating the weight parameter of the target neural network according to the weight parameter of the current neural network;

s10: and repeatedly executing the steps S2-S9 until all the data of the satellite training state data set are input.

According to an exemplary embodiment of the invention, initializing the model in step S1 comprises defining a loss function.

According to an exemplary embodiment of the present invention, the input of the model is states of the first satellite and the second satellite, and the output is a return value (Q value) after the first satellite or the second satellite performs the semi-major axis control action.

According to an exemplary embodiment of the present invention, in step S3, during the initial cycle, the states of the first satellite and the second satellite at the current time are the states of the first satellite and the second satellite at the initial time.

According to an example embodiment of the present invention, in step S3, the method for obtaining the semi-major axis control behavior executed by the first satellite or the second satellite according to the greedy policy includes: the first satellite or the second satellite randomly selects the semimajor axis control action according to the first designated probability or executes the semimajor axis control action corresponding to the maximum Q value according to the second designated probability; the sum of the first specified probability and the second specified probability equals 1.

According to an exemplary embodiment of the present invention, in step S6, the method for calculating the target value of each satellite combined state data set according to the target neural network weight parameter uses formula 3:

wherein, y _j Representing a target value, gamma is a discount value, theta' is a target neural network weight parameter,

represents the maximum Q value, s, of the first satellite or the second satellite in the combined state data set of the satellites at the next moment after the semi-major axis control action a is executed _j+1 Representing the states of the first satellite and the second satellite at the next moment in the set of satellite combined state data sets, a representing the semi-major axis control action executed by the first satellite or the second satellite at the current moment in the set of satellite combined state data sets, r _j Representing a reward in a set of satellite constellation state data sets.

According to an exemplary embodiment of the present invention, in step S7, the loss function adopts formula 4:

wherein, y _j Representing the target value, theta is the current weight parameter of the neural network, Q(s) _j ，a _j (ii) a Theta) represents that the first satellite or the second satellite in the set of satellite combination state data sets performs the semi-major axis control action a at the current time _j Value of later Q, s _j Representing the states of the first satellite and the second satellite at the current time in a set of satellite combined state data sets, a _j And m is the number of the satellite combination state data sets.

According to an exemplary embodiment of the present invention, in step S8, the method for updating Q value according to the value function adopts formula 5:

Q(s _t ，a _t )←Q(s _t ，a _t )+α[r _t +γmaxQ(s _t+1 ，a _t )-Q(s _t ，a _t )] (5)；

wherein, Q(s) at the left side of the arrow _t ，a _t ) The first satellite or the second satellite representing the updated current time performs the semi-major axis control action a _t The latter Q value, Q(s) on the right side of the arrow _t ，a _t ) The first satellite or the second satellite representing the current moment before updating executes the semimajor axis control action a _t Value of later Q, Q(s) _t+1 ，a _t ) The first satellite or the second satellite executes the semimajor axis control action a at the next moment of the current moment before the update _t The latter Q value, alpha is the weight, gamma is the discount value, s _t Representing the states of the first satellite and the second satellite at the current time, a _t Representing the semi-major axis control action, s, performed by the first satellite or the second satellite at the current moment _t+1 Representing the states of the first satellite and the second satellite at the next moment in time of the current moment, r _t Indicating a reward.

the time t is the current time, and the time t +1 is the next time of the current time.

The invention provides a satellite relative phase maintenance strategy model based on the Nature DQN, and a model is established by adopting the modeling method of the satellite relative phase maintenance strategy model based on the Nature DQN.

As a third aspect of the present invention, a method for obtaining a satellite relative phase maintaining optimal strategy is provided, wherein a satellite relative phase maintaining strategy model based on Nature DQN is established by using the modeling method of the satellite relative phase maintaining strategy model based on Nature DQN;

obtaining an optimal strategy according to the model;

the method for obtaining the optimal strategy according to the model adopts a formula 6:

wherein, pi represents the strategy of the first satellite or the second satellite for semi-major axis control, pi ^* Represents the optimal semi-major axis control strategy learned by the model, namely the situation that the states of the first satellite and the second satellite are s at the initial momentUnder the condition of passing through strategy pi ^* The semimajor axis of (a) yields the greatest return under control behavior a.

As a fourth aspect of the present invention, there is provided an electronic apparatus comprising:

one or more processors;

storage means for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement the method for modeling a Nature DQN-based satellite relative phase preservation policy model.

As a fifth aspect of the present invention, there is provided a computer readable medium, having stored thereon a computer program, which when executed by a processor, implements the method for modeling a natural DQN-based satellite relative phase preserving policy model.

The invention has the beneficial effects that:

according to the scheme, modeling is carried out through the neural network, deep reinforcement learning and decision making are carried out by utilizing the current state data of the first satellite and the second satellite, complex modeling is carried out without utilizing various perturbation forces received by the satellites in the orbital operation process, an optimal relative phase control strategy can be obtained, the consumption of satellite fuel can be reduced, and the method has important significance and value for practical aerospace application.

Drawings

Fig. 1 schematically shows a step diagram of a modeling method of a satellite relative phase preservation strategy model based on Nature DQN.

Fig. 2 schematically shows a block diagram of an electronic device.

FIG. 3 schematically shows a block diagram of a computer-readable medium.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The same reference numerals denote the same or similar parts in the drawings, and thus, a repetitive description thereof will be omitted.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the application. One skilled in the relevant art will recognize, however, that the embodiments of the present application can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations, or operations have not been shown or described in detail to avoid obscuring aspects of the application.

The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.

The flowcharts shown in the figures are illustrative only and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

It will be understood that, although the terms first, second, third, etc. may be used herein to describe various components, these components should not be limited by these terms. These terms are used to distinguish one element from another. Thus, a first component discussed below could be termed a second component without departing from the teachings of the present concepts. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

It will be appreciated by those skilled in the art that the drawings are merely schematic representations of exemplary embodiments, and that the blocks or processes shown in the drawings are not necessarily required to practice the present application and are, therefore, not intended to limit the scope of the present application.

According to the scheme, observation information is obtained from the environment based on strong perception capability of deep learning, and the expected return value is obtained based on strong decision-making capability of reinforcement learning to evaluate the value of the footstock. The entire learning process can be described as: at a certain moment, the satellite interacts with the flying environment to acquire observation information, the current state information is mapped into corresponding actions (control actions) through the neural network, the environment reacts to the actions to acquire corresponding reward values and next observation information, and the complete interaction information is stored in an experience pool. By continuously cycling the processes, the optimal strategy for achieving the target can be finally obtained.

The satellite described in the scheme is an MEO satellite. The medium orbit (MEO) earth satellite mainly refers to an earth satellite with a satellite orbit being 2000-20000 km away from the earth surface. It belongs to the earth nonsynchronous satellite, mainly used as the supplement and extension of the land mobile communication system, organically combined with the ground public network, and realized the global personal mobile communication. And may also be used as a satellite navigation system. Therefore, it is greatly advantageous in global personal mobile communication and satellite navigation systems. The medium orbit satellite has the advantages of both a static orbit satellite and a low orbit earth satellite, and can realize real global coverage and more effective frequency reuse.

The satellite is used for completing tasks including global communication, global navigation, global environment monitoring and the like in bipolar areas, and any place on the earth can be covered by the satellite at any time. To do this, it is not enough to use a single satellite or a satellite ring, and several satellite rings need to be configured in a certain way to form a satellite network-constellation. The satellite constellation is a collection of satellites which can normally work when launched into orbit, and is usually a satellite network formed by configuring a plurality of satellite rings according to a certain mode. The main satellite constellations include a GPS satellite constellation, a GLONASS satellite constellation, a Galileo satellite constellation, a Beidou satellite constellation and the like.

The Deep Q Network (DQN) algorithm is a network in Deep reinforcement learning, and is a combination of Deep learning and Q learning. The method integrates the advantages of reinforcement learning and deep learning, so that the method is widely applied to various fields at present.

Deep reinforcement learning is taken as a new research hotspot in the field of artificial intelligence, combines the deep learning with the reinforcement learning, and realizes direct control and decision from original input to output through an end-to-end learning mode. Because the deep learning is based on a neural network structure, the deep learning has stronger perception capability to the environment, but lacks certain decision control capability; and reinforcement learning has very strong behavior decision-making capability. Therefore, the perception capability of deep learning and the decision capability of the reinforcement learning are combined in the deep reinforcement learning, the advantages are complementary, and the control strategy can be directly learned from high-dimensional original data. Since the deep reinforcement learning method is provided, substantial breakthrough is achieved in a plurality of tasks requiring sensing of high-dimensional original input data and decision control, and due to the end-to-end learning advantage of deep learning, the problems of difficult modeling and difficult planning can be solved by deep reinforcement learning.

The DQN algorithm uses the same network for calculating the target value and the current value, i.e., the calculation of the target value uses the parameters in the Q network to be trained currently, and the target value is used for updating the parameters of the network, so that the two depend on each other circularly, which is not favorable for the convergence of the algorithm. Compared with the DQN, the Nature DQN increases a target network, reduces the dependency relationship between the calculation of a target Q value and Q network parameters to be updated through a double-network structure, and integrates the advantages of reinforcement learning and deep learning, thereby greatly improving the stability of the DQN algorithm.

Nature DQN reduces the correlation between the target value of the computational target network and the current network parameters by using two independent but identical Q networks (one as the current Q network and the other as the target Q network). The target network is updated at regular intervals by copying the weight parameters of the current network to the target network, and the target Q value is kept unchanged in a period of time by the double-network structure, so that the correlation between the calculated target Q value and the current network parameters is reduced, and the convergence and the stability of the algorithm are improved.

As a first embodiment of the present invention, there is provided a method for modeling a satellite relative phase preserving strategy model based on Nature DQN, as shown in fig. 1, including the steps of:

s1: initializing a model, and acquiring a plurality of groups of satellite training state data sets, wherein each group of satellite training state data sets comprises states of a first satellite and a second satellite at an initial moment, a plurality of expected orbit control moments and expected orbit control times; the states of the first satellite and the second satellite include relative phase differences of the first satellite and the second satellite.

The input of the model is the states of the first satellite and the second satellite, and the output is the return value (Q value) after the first satellite or the second satellite executes the semimajor axis control action.

The method for initializing the model comprises the following steps: defining a loss function; initializing the capacity of an experience pool to be N, wherein the experience pool is used for storing training samples; initializing a current neural network weight parameter theta and a target neural network weight parameter theta ', theta' = theta of the network model; the input of the initialization network is the states s of the first satellite and the second satellite, and the calculated network output is the return value Q after the first satellite or the second satellite executes the semimajor axis control action.

The motion state of a satellite at a certain moment can be represented by six elements of Kepler orbits: semi-major axis a ₁ Eccentricity e, ascension omega, perigee amplitude omega, orbit inclination i _o Mean anomaly angle M, and the orbital phase angle of the satellite is λ = ω + M. I.e. the state of motion of the satellite can be denoted as a ₁ ，e，i _o Omega, M }. The states of the first satellite and the second satellite can be obtained from the motion states of the satellites, and the states of the first satellite and the second satellite comprise relative phase differences of the first satellite and the second satellite.

The method for obtaining the relative phase difference between the first satellite and the second satellite through the motion state of the satellites comprises the following steps:

in the process of constellation construction, the satellite can not accurately enter a theoretical design orbit. There is often a bias (hereinafter referred to as the in-orbit bias) that will also affect the long term variation of the satellite phase from the design phase (mainly the orbit semi-major axis bias). This long-term change is:

in the formula, δ λ ₁ Represents a long-term change in phase;

the deviation of the semi-major axis of the track is equal to the semi-major axis at the time t minus the semi-major axis at the initial time; n is the average angular velocity of motion of the satellite,

wherein G is a universal gravitation constant, and M is the mass of the earth; lambda [ alpha ] ₁ For the long-term rate of change of phase of the satellite due to J2 perturbation,

is the long-term phase variability of the satellite due to gravity of the sun and moon, a ₀ The method comprises the following steps of (1) setting a satellite initial orbit semi-major axis, wherein t is a current moment, and tuprit is an initial moment; item 1 in brackets on the right of the equal sign

For deviations of the semi-major axis of the track

Long-term phase change due to angular velocity change of satellite motion, items 2 and 3

Is that

The portion of the change caused by the long-term perturbation of the phase is 3 orders of magnitude smaller than the 1 st term and is generally negligible. Then the satellite orbit phase resulting from perturbation of the J2 term, diurnal gravitational perturbation and orbit entry bias evolves as:

in the formula, delta lambda is the deviation of the actual working phase of the satellite relative to the designed orbit under the condition of two bodies;

wherein G is a constant of universal gravitation, and M is the mass of the earth; lambda ₁ For the long-term rate of change of phase of the satellite due to J2 perturbation,

is the long-term phase variability of the satellite due to gravity of the sun and moon, a ₀ Is the initial orbit semi-major axis of the satellite, t is the current time (time t), t _{Beginning of the design} Is the initial time.

J2 perturbation refers to a long-period variation in the number of orbitals due to earth asphericity. The two-body condition refers to the study of the dynamics of two celestial bodies that can be considered mass points under the influence of their gravitational attraction to each other.

As can be seen from the above formula, for a class of satellites with the same orbit altitude, eccentricity and orbit inclination, the main parts of the long-term phase drifts caused by orbit perturbation are the same, and no significant relative phase change occurs, but the long-term phase drifts of the satellites are different due to the existence of the orbit entry deviation. Therefore, the objective of relative phase control is to eliminate the initial tracking semi-major axis deviation.

For a first satellite (denoted by i) and a second satellite (denoted by i), their semi-major axes are a _i And a _j Phase angles are each lambda _i And λ _j Relative deviation of semimajor axes of

From the above analysis, the relative phase change between the constellation satellites is the relative phase difference between the first satellite and the second satellite, which is obtained by the following formula:

wherein, Δ λ _ij Is the relative phase difference of the first satellite and the second satellite;

is the orbital semi-major axis deviation of the first satellite,

the semimajor axis deviation of the orbit of the second satellite is the relative semimajor axis deviation of the first satellite and the second satellite

t is the current time (t time), and the beginning of t is the initial time; lambda [ alpha ] _1，i Is the long-term rate of change of phase, λ, of the first satellite due to J2 perturbation _1，j The long term rate of change of phase due to J2 perturbation for the second satellite,

is the first satellite due to gravity of sun and moon

The resulting long-term rate of change of the phase,

the long-term phase change rate of the second satellite due to the gravity of the sun and the moon; n is the average angular velocity of motion of the satellite,

wherein G is the constant of universal gravitation, M is the mass of the earth, a ₀ Is the initial orbit semi-major axis of the satellite;

the initial orbit semimajor axis of the satellite is obtained, the average motion angular velocities of the first satellite and the second satellite are equal, and the initial orbit semimajor axes of the first satellite and the second satellite are equal; the semimajor axis bias is equal to the semimajor axis at time t (current time) minus the satellite's initial orbital semimajor axis.

In the formula, Δ λ _ij Is the amount of relative phase change of the first satellite relative to the second satellite. Considering that the orbit semimajor axis, eccentricity and inclination angle of each satellite in the constellation are the same, the phases of the constellation satellites under the orbit perturbation action are cheap for a long time and can be considered to be the same, and then the above formula is further simplified as follows:

the relative phase evolution of the constellation is mainly caused by the satellite orbit entering deviation, so that the relative phase maintenance can be realized by adjusting the semimajor axis of the satellite.

In summary, the relative phase difference between the first satellite and the second satellite at the next time is obtained by using equation 2:

wherein, Δ λ _ij The relative phase difference between the first satellite and the second satellite at the next moment;

is the orbital semi-major axis deviation of the first satellite,

is the orbit semimajor axis deviation of the second satellite, and the semimajor axis relative deviation of the first satellite and the second satellite is

t ₁ At a time next to time t, t _{First stage} Is the initial time; n is the average angular velocity of motion of the satellite,

wherein G is a universal gravitation constant, and M is the mass of the earth; a is a ₀ The initial orbit semimajor axis of the satellite is obtained, the average motion angular velocities of the first satellite and the second satellite are equal, and the initial orbit semimajor axes of the first satellite and the second satellite are equal; the semimajor axis deviation is equal to the semimajor axis at the next time instant of time t minus the initial orbital semimajor axis of the satellite.

The data set is composed of a plurality of groups of satellite training state data sets, the data of the states for satellite training in the data set is more than or equal to 100 groups, and the more the data of the satellite states, the more accurate the result of model training.

The data of the multiple groups of satellite training state data sets are data of a training set, and simulation data can be adopted, or simulation data and real data can be combined. The time line in a time period comprises a plurality of time points, the state of the satellite at each time point is different, and different effects can be obtained when the orbit control strategy is executed at different time points. According to the scheme, through a plurality of groups of satellite training state data groups, the satellite state of each group of satellites at the initial moment corresponds to the satellite state of a time point, the time points corresponding to the initial moments of each group of satellite training state data groups are different, namely the initial moments of each group of satellite training state data groups are different.

The orbit semi-major axis is one of orbit elements of the artificial satellite, and indicates the size of the orbit. When the instantaneous orbit is elliptical, the semimajor axis is half of the major axis; when the track is circular, the semi-major axis is the radius.

S2: and inputting the states of the first satellite and the second satellite at the initial time of a group of satellite training state groups into the model to obtain all semi-major axis control behaviors after the initial time and the corresponding output Q values.

The state of the first satellite and the second satellite at the current moment is s _t 。

the moment t is the current moment, and the next moment of the moment t is the moment t + 1.

After the first satellite or the second satellite executes the semimajor axis control action at the current moment, the state (input of the next network) of the first satellite or the second satellite at the next moment is obtained, namely s _t+1 。

After the first satellite or the second satellite executes the semi-long axis control action at the current moment, the states of the first satellite and the second satellite at the next moment are obtained, namely s _t+1 . Since the relative phases of the first satellite and the second satellite need to be adjusted, only one of the first satellite and the second satellite needs to control the semi-major axis.

S3: and acquiring the states of the first satellite and the second satellite at the current moment, and acquiring the semi-long axis control behavior executed by the first satellite or the second satellite according to a greedy strategy.

During the initial circulation, the states of the first satellite and the second satellite at the current time are the states of the first satellite and the second satellite at the initial time.

The method for acquiring the semi-long axis control action executed by the first satellite or the second satellite according to the greedy strategy comprises the following steps: the first satellite or the second satellite randomly selects the semimajor axis control behavior according to a first specified probability, or executes the semimajor axis control behavior corresponding to the maximum Q value according to a second specified probability; the sum of the first specified probability and the second specified probability equals 1.

If the first specified probability is greater than the second specified probability, the method for obtaining the semimajor axis control behavior executed by the first satellite or the second satellite according to the greedy strategy adopts the following steps: randomly selecting a semi-major axis control behavior by the first satellite or the second satellite with a first specified probability;

if the second designated probability is greater than the first designated probability, the method for obtaining the semi-major axis control behavior executed by the first satellite or the second satellite according to the greedy strategy adopts the following steps: the first satellite or the second satellite executes the semimajor axis control action corresponding to the maximum Q value according to the second specified probability;

and if the first specified probability is equal to the second specified probability, selecting one of the methods for acquiring the semi-major axis control action executed by the first satellite or the second satellite according to the greedy strategy: the first satellite or the second satellite randomly selects the semimajor axis control action according to the first designated probability or executes the semimajor axis control action corresponding to the maximum Q value according to the second designated probability.

The greedy policy is an epsilon-greedy policy.

The first assigned probability is ε, which decreases as the number of iterations increases.

The semimajor axis control action executed by the first satellite or the second satellite at the current moment is a _t 。

S4: executing the semimajor axis control action to obtain the states of the first satellite and the second satellite at the next moment; and awarding according to the states and relative phase keeping strategy reward functions of the first satellite and the second satellite at the next moment.

When the relative phase deviation of the ith star and the jth star in the constellation exceeds a threshold value (| Delta lambda) during the long-term in-orbit operation of the satellite _ij |＞Δλ _max ) One of the satellites needs to be controlled in orbit, interference of various factors is eliminated through active control of fuel consumption, and the satellite phase adjustment time is a small amount relative to the service life of the navigation satellite, so that the performance index only requires the minimum fuel consumption in the whole process.

Assuming that the satellite control frequency is fixed (i.e. orbit control is performed after a period of time), it is expected that the control quantity of this time can ensure that the phases of the two satellites are within the holding range at the next control and the control quantity is as small as possible, so the variation of the semimajor axis at time t (current time) determines the phase difference of the two satellites at extrapolated time t +1 (next time of current time). For this purpose, a reward strategy at time t is designed.

Therefore, the relative phase-preserving strategy reward function (reward strategy at time t) is equation 1:

wherein r is _t Awards, Δ λ, for semi-major axis control actions by the first satellite or the second satellite at the present time ₀ Is the relative phase difference, Δ λ, of the first satellite and the second satellite in nominal orbit _s Is a first satellite and a second satelliteIs kept at a threshold value, Δ λ _t+1 Is the relative phase difference between the first satellite and the second satellite at the next moment, | Δ λ _t+1 -Δλ ₀ The method comprises the steps that I, after semi-major axis control action is carried out on a first satellite or a second satellite at the current moment, the relative phase difference between the first satellite and the second satellite at the next moment extrapolated to the current moment is changed relative to a nominal orbit (theoretical orbit), namely the influence of the semi-major axis control action carried out on the first satellite or the second satellite at the current moment on the relative phase difference between the first satellite and the second satellite is influenced; giving a penalty when the relative phase difference at time t +1 (the time next to time t) is not within the expected range; t is the current time, t ₀ The expected tracking time closest to the current time is used.

S5: and storing the states of the first satellite and the second satellite at the current moment, the semimajor axis control action executed by the first satellite or the second satellite, the reward and the states of the first satellite and the second satellite at the next moment into an experience pool as a group of satellite combination state data sets.

S6: and taking out a plurality of satellite combination state data sets from the experience pool, and calculating the target value of each satellite combination state data set according to the target neural network weight parameter.

The number of the satellite combination state data sets is m, m is a natural number greater than 0, and m is smaller than the number of the satellite training state data sets. The m groups of satellite combination state data sets are small-batch satellite combination state data sets. The number of the satellite combination state data sets is determined according to the number of the satellite training state data sets.

The method for calculating the target value of each satellite state data set according to the target neural network weight parameter adopts a formula 3:

wherein, y _j Representing the target value, gamma is a discount value (attenuation factor), theta' is a target neural network weight parameter,

represents the maximum Q value, s, of the first satellite or the second satellite in the combined state data set of the satellites at the next moment after the semi-major axis control action a is executed _j+1 Representing the states of the first satellite and the second satellite at the next moment in the set of satellite combined state data sets, a representing the semi-major axis control action executed by the first satellite or the second satellite at the current moment in the set of satellite combined state data sets, r _j A reward is indicated in a set of satellite constellation state data sets.

And ending the task to be the convergence of the model or the iteration. When s is _j+1 For model convergence or iteration completion, y _i Is equal to r _j (ii) a When s is _j+1 When model convergence is not reached or iteration is completed, y _i Is equal to

The conditions for model convergence are: the error calculated by the loss function is within a specified range.

The conditions for iteration completion are: all steps are executed.

S7: and calculating errors according to the loss function, and updating the current weight parameters of the neural network.

An error is also calculated based on the target value.

The loss function uses equation 4:

wherein, y _j Representing the target value, theta is the current weight parameter of the neural network, Q(s) _j ，a _j (ii) a Theta) represents the current time instant at which the first satellite or the second satellite in the set of satellite combined state data sets performs the control action a _j Value of Q after, s _j Representing the current states of the first satellite and the second satellite in a set of satellite combined state data sets, a _j Representing the semimajor axis control action, r, performed by the first satellite or the second satellite at the current moment _j Representing a reward in a set of satellite constellation state data sets; m is the number of the satellite combination state data sets。

The error is the calculation result of the loss function using equation 4.

The current neural network weight parameters are updated by a Stochastic Gradient Descent (SGD) method.

r _t 、a _t 、s _t 、s _t+1 Samples in the data set representing the satellite training state data set, e _j 、a _j 、s _j 、s _j+1 Representing samples in an experience pool.

S5-S7 adjust the parameters of the model, so that the calculation accuracy of the model can be higher.

S8: updating the Q value according to the value function; and taking the states of the first satellite and the second satellite at the next moment as the states of the first satellite and the second satellite at the current moment.

The method for updating the Q value according to the value function uses equation 5:

Q(s _t ，a _t )←Q(s _t ，a _t )+a[r _t +γmaxQ(s _t+1 ，a _t )-Q(s _t ，a _t )] (5)：

wherein, Q(s) at the left side of the arrow _t ，a _t ) The first satellite or the second satellite representing the updated current time performs a semi-major axis control action a _t The latter Q value, Q(s) on the right side of the arrow _t ，a _t ) The first satellite or the second satellite representing the current moment before updating executes the semimajor axis control action a _t Later Q value, Q(s) _t+1 ，a _t ) The first satellite or the second satellite representing the next moment of the current moment before updating executes the semimajor axis control action a _t The Q value is a weight, gamma is a discount value (attenuation factor), st represents the states of the first satellite and the second satellite at the current moment, and a _t Representing the semimajor axis control action, s, performed by the first satellite or the second satellite at the current moment _t+1 Representing the states of the first satellite and the second satellite at the next moment in time of the current moment, r _t Indicating a reward.

Wherein both α and γ range between 0 and 1.

S9: repeating steps S3-S8, wherein the number of times of executing steps S3-S8 is equal to the expected orbit control number of times of the set of satellite training state data; and after the steps S3-S8 of the appointed iteration times are repeatedly executed, updating the weight parameter of the target neural network according to the weight parameter of the current neural network.

And after the iteration of the appointed iteration times is completed, updating the target neural network weight parameter into the current neural network weight parameter.

According to the modeling method, the state data of the first satellite and the second satellite are used as the input of the neural network model, the generated return value is used as the output, the Nature DQN neural network is adopted, complex modeling is not needed to be carried out by utilizing various perturbation forces received by the satellites in the orbit operation process, deep reinforcement learning is directly adopted for learning and decision making, improvement is carried out based on the DQN algorithm, the method is suitable for training a large-scale neural network, the stability of the DQN algorithm is greatly improved, the consumption of satellite fuel can be reduced, and the method has important significance and value for practical aerospace application.

According to a second embodiment of the invention, the invention provides a satellite relative phase keeping strategy model based on the Nature DQN, and the model is established by adopting the modeling method of the satellite relative phase keeping strategy model based on the Nature DQN of the first embodiment.

According to a third specific embodiment of the invention, the invention provides a method for acquiring a satellite relative phase maintaining optimal strategy, which comprises the steps of establishing a satellite relative phase maintaining strategy model based on the Nature DQN by adopting the modeling method of the satellite relative phase maintaining strategy model based on the Nature DQN of the first embodiment;

and obtaining an optimal strategy according to the model.

The method for obtaining the optimal strategy according to the model adopts a formula 5:

wherein, pi representsStrategy for semi-major axis control of satellite, pi ^* Represents the optimal semi-major axis control strategy learned through the model, namely the strategy pi is passed under the condition that the states of the first satellite and the second satellite at the initial moment are s ^* The semimajor axis of (a) yields the greatest return under control behavior a.

According to a fourth specific embodiment of the present invention, there is provided an electronic device, as shown in fig. 2, where fig. 2 is a block diagram of an electronic device shown according to an exemplary embodiment.

An electronic device 200 according to this embodiment of the present application is described below with reference to fig. 2. The electronic device 200 shown in fig. 2 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in FIG. 2, electronic device 200 is embodied in the form of a general purpose computing device. The components of the electronic device 200 may include, but are not limited to: at least one processing unit 210, at least one memory unit 220, a bus 230 connecting different system components (including the memory unit 220 and the processing unit 210), a display unit 240, and the like.

Wherein the storage unit stores program code that can be executed by the processing unit 210, such that the processing unit 210 performs the steps according to various exemplary embodiments of the present application described in the present specification. For example, the processing unit 210 may perform the steps as shown in fig. 1.

The memory unit 220 may include readable media in the form of volatile memory units, such as a random access memory unit (RAM) 2201 and/or a cache memory unit 2202, and may further include a read only memory unit (ROM) 2203.

The storage unit 220 can also include a program/utility 2204 having a set (at least one) of program modules 2205, such program modules 2205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which or some combination thereof may comprise an implementation of a network environment.

Bus 230 may be any bus representing one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 200 may also communicate with one or more external devices 200' (e.g., keyboard, pointing device, bluetooth device, etc.) such that a user can communicate with the devices with which the electronic device 200 interacts, and/or any device (e.g., router, modem, etc.) with which the electronic device 200 can communicate with one or more other computing devices. Such communication may occur via an input/output (I/O) interface 250. Also, the electronic device 200 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet) via the network adapter 260. The network adapter 260 may communicate with other modules of the electronic device 200 via the bus 230. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 200, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, and may also be implemented by software in combination with necessary hardware.

Thus, according to a fifth embodiment of the present invention, there is provided a computer readable medium. As shown in fig. 3, the technical solution according to the embodiment of the present invention can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to make a computing device (which can be a personal computer, a server, or a network device, etc.) execute the above-mentioned method according to the embodiment of the present invention.

The software product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The computer readable storage medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable storage medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In situations involving remote computing devices, the remote computing devices may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to external computing devices (e.g., through the internet using an internet service provider).

The computer-readable medium carries one or more programs which, when executed by a device, cause the computer-readable medium to carry out the functions of the first embodiment.

Those skilled in the art will appreciate that the modules described above may be distributed in the apparatus as described in the embodiments, and that corresponding changes may be made in one or more apparatus that are unique from the embodiments. The modules of the above embodiments may be combined into one module, or further split into multiple sub-modules.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiment of the present invention can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which can be a personal computer, a server, a mobile terminal, or a network device, etc.) to execute the method according to the embodiment of the present invention.

The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes will occur to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A modeling method of a satellite relative phase retention strategy model based on Nature DQN is characterized by comprising the following steps:

s2: inputting states of a first satellite and a second satellite of a group of satellite training state data sets at an initial moment into the model to obtain all semi-major axis control behaviors after the initial moment and corresponding output Q values;

s4: executing the semimajor axis control action to obtain the states of the first satellite and the second satellite at the next moment; awarding is obtained according to the states and relative phases of the first satellite and the second satellite at the next moment and a strategy reward function; the relative phase holding strategy reward function adopts formula 1:

wherein r is _t Awards, Δ λ, for semi-major axis control actions by the first satellite or the second satellite at the current time ₀ Relative phase difference, Δ λ, between the first satellite and the second satellite in nominal orbit _s Maintaining a threshold, Δ λ, for a relative phase difference of a first satellite and a second satellite _t+1 Is the relative phase difference between the first satellite and the second satellite at the next moment, | Δ λ _t+1 -Δλ ₀ The method comprises the steps that I is the change of the relative phase difference between a first satellite and a second satellite at the next moment after the semi-long axis control action is carried out on the first satellite or the second satellite at the current moment relative to a nominal orbit, namely the influence of the semi-long axis control action carried out on the first satellite or the second satellite at the current moment on the relative phase difference between the first satellite and the second satellite; t is the current moment; t is t ₀ The expected track control time closest to the current time is obtained;

s6: taking out a plurality of satellite combination state data sets from the experience pool, and calculating the target value of each satellite combination state data set according to the target neural network weight parameter;

2. The method for modeling a satellite relative phase preservation strategy model according to claim 1, wherein in step S3, during the initial cycle, the states of the first satellite and the second satellite at the current time are the states of the first satellite and the second satellite at the initial time.

3. The modeling method of the satellite relative phase preservation strategy model based on Nature DQN according to claim 1, wherein in step S3, the method of obtaining semi-long axis control behavior performed by the first satellite or the second satellite according to greedy strategy comprises: the first satellite or the second satellite randomly selects the semimajor axis control behavior according to a first specified probability, or executes the semimajor axis control behavior corresponding to the maximum Q value according to a second specified probability; the sum of the first specified probability and the second specified probability equals 1.

4. The modeling method of satellite relative phase preservation strategy model based on Nature DQN according to claim 1, wherein in step S6, the method for calculating the target value of each satellite state data set according to the target neural network weight parameter uses formula 3:

represents the maximum Q value, s, of the first satellite or the second satellite in the combined state data set of the satellites at the next moment after the semi-major axis control action a is executed _j+1 Representing the states of the first satellite and the second satellite at the next moment in a set of satellite combined state data sets, a representing the semimajor axis control action executed by the first satellite or the second satellite, r _j Representing a reward in a set of satellite constellation state data sets.

5. The method for modeling a satellite relative phase preservation strategy model based on Nature DQN according to claim 1, wherein in step S7, the loss function is given by formula 4:

wherein, y _j Representing the target value, theta is the current weight parameter of the neural network, Q(s) _j ，a _j (ii) a Theta) represents that the first satellite or the second satellite in the set of satellite combined state data sets performs the semimajor axis control action a at the current moment _j Value of Q after, s _j Representing the states of a first satellite and a second satellite at the current time in a set of satellite combined state data sets, a _j The semi-major axis control action executed by the first satellite or the second satellite at the current moment is represented, and m is the number of the satellite combination state data sets.

6. The method for modeling a satellite relative phase preservation strategy model based on Nature DQN according to claim 1, wherein in step S8, the method for updating Q value according to value function adopts formula 5:

Q(s _t ，a _t )←Q(s _t ，a _t )+α[r _t +γmax Q(s _t+1 ，a _t )-Q(s _t ，a _t )] (5)；

wherein, Q(s) at the left side of the arrow _t ，a _t ) Indicating that the first satellite or the second satellite executes the semimajor axis control action a at the updated current moment _t The latter Q value, Q(s) on the right side of the arrow _t ，a _t ) The first satellite or the second satellite representing the current moment before the update performs a semi-major axis control action a _t Later Q value, Q(s) _t+1 ，a _t ) The first satellite or the second satellite executes the semimajor axis control action a at the next moment of the current moment before updating _t The latter Q value, alpha is the weight, gamma is the discount value, s _t Representing the states of the first satellite and the second satellite at the current time, a _t Representing the semimajor axis control action, s, performed by the first satellite or the second satellite at the current moment _t+1 Representing the states of the first satellite and the second satellite at the next moment in time of the current moment, r _t Indicating a reward.

7. A satellite relative phase preservation strategy model based on Nature DQN, characterized in that the model is established using the modeling method of any one of claims 1-6.

8. The method for obtaining the satellite relative phase maintaining optimal strategy is characterized in that a satellite relative phase maintaining strategy model based on the Nature DQN is established according to the modeling method of any one of claims 1 to 6;

obtaining an optimal strategy according to the model;

wherein, pi represents the strategy of the satellite for semi-major axis control, pi ^* Represents the optimal semi-major axis control strategy learned by the model, namely the strategy pi is passed under the condition that the states of the first satellite and the second satellite at the initial moment are s ^* The semimajor axis of (a) yields the greatest return under control behavior a.

9. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-6.

10. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-6.