CN115865166B - Modeling method, system and acquisition method for satellite north-south maintenance strategy model - Google Patents

Modeling method, system and acquisition method for satellite north-south maintenance strategy model Download PDF

Info

Publication number
CN115865166B
CN115865166B CN202211408049.1A CN202211408049A CN115865166B CN 115865166 B CN115865166 B CN 115865166B CN 202211408049 A CN202211408049 A CN 202211408049A CN 115865166 B CN115865166 B CN 115865166B
Authority
CN
China
Prior art keywords
satellite
inclination angle
state data
south
current
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211408049.1A
Other languages
Chinese (zh)
Other versions
CN115865166A (en
Inventor
吴琳琳
吴新林
何镇武
吴凌根
陈倩茹
王丽颖
张琳娜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Emposat Co Ltd
Original Assignee
Emposat Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Emposat Co Ltd filed Critical Emposat Co Ltd
Priority to CN202211408049.1A priority Critical patent/CN115865166B/en
Publication of CN115865166A publication Critical patent/CN115865166A/en
Application granted granted Critical
Publication of CN115865166B publication Critical patent/CN115865166B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Control Of Position, Course, Altitude, Or Attitude Of Moving Bodies (AREA)

Abstract

The invention relates to the aerospace field, and provides a modeling method, a system and an acquisition method of a satellite north-south maintenance strategy model, wherein the modeling method comprises the following steps: s1: initializing a model; s2: obtaining inclination angle control behaviors and Q values; s3: acquiring dip angle control behaviors executed by satellites; s4: executing the dip angle control action to obtain rewards; s5: storing the satellite combination state data set in an experience pool; s6: calculating a target value; s7: calculating an error according to the loss function, and updating the weight parameter of the current neural network; s8: updating the Q value according to the value function; s9: repeating steps S3-S8, the number of times steps S3-S8 are performed being equal to the expected orbit control number of the set of satellite training state data sets; updating the weight parameters of the target neural network after repeating the steps S3-S8 of the appointed iteration times; s10: steps S2-S9 are repeated until all the data of the satellite training state data set has been entered. The scheme can obtain an optimal decision strategy and reduce the consumption of satellite fuel.

Description

Modeling method, system and acquisition method for satellite north-south maintenance strategy model
Technical Field
The invention relates to the technical field of aerospace, in particular to a modeling method, a system, an acquisition method, equipment and a medium of a satellite north-south maintenance strategy model based on Nature DQN.
Background
With the continuous development of human aerospace activities, more and more remote sensing satellites provide assistance for daily life of people.
The GEO satellite is influenced by the sun and moon gravitation and the non-spherical perturbation of the earth in the running process, so that drifting occurs in the north-south latitude direction, and the control of the north-south position maintenance (tilt angle maintenance) of the GEO triaxial stabilized satellite plays a vital role in the aerospace field.
The prior art firstly establishes a dynamic model in orbit maneuver through analysis of various perturbation forces such as earth shape, solar-lunar attraction, solar light pressure and the like received by satellites in the orbit maneuver process, and then establishes a short-term strategy and a long-term strategy for north-south maintenance. According to the method, complex modeling is carried out through various perturbation forces received by the satellite in the orbit running process, however, due to the complexity of space stress and uncertainty of parameters of the satellite, the satellite cannot be accurately modeled, the parameters are multiple, the calculation is complex, the accuracy of satellite inclination angle control is further affected, and more fuel can be consumed. Moreover, the existing reinforcement learning method cannot solve the problem that the state and action space dimensions are very high.
Therefore, there is a need to develop a modeling method, a system, an acquisition method, equipment and a medium for a satellite north-south maintenance strategy model based on Nature DQN, so that modeling difficulty is reduced, and a north-south maintenance strategy is accurately calculated.
Disclosure of Invention
The invention aims to provide a modeling method, a system, an acquisition method, equipment and a medium based on a Nature DQN satellite north-south maintenance strategy model, which do not need to carry out complex modeling when a GEO triaxial stabilized satellite is subjected to north-south position maintenance, do not need to consider the complexity of space stress and the uncertainty of the satellite self parameters, have strong behavior decision-making capability in reinforcement learning, can obtain an optimal decision strategy and reduce the consumption of satellite fuel.
In order to solve the above technical problems, as one aspect of the present invention, there is provided a modeling method of a satellite north-south maintenance policy model based on Nature DQN, comprising the steps of:
s1: initializing a model, and acquiring a plurality of groups of satellite training state data groups, wherein each group of satellite training state data groups comprises an initial state of a satellite, a plurality of expected orbit control moments and expected orbit control times; the initial state of each satellite comprises an initial moment satellite inclination angle;
s2: inputting the initial time satellite inclination angles of a group of satellite training state data sets into the model to obtain all inclination angle control behaviors and corresponding output Q values after the initial time;
s3: acquiring the inclination angle of a satellite at the current moment, and acquiring the inclination angle control behavior executed by the satellite according to a greedy strategy;
s4: executing the inclination angle control action to obtain the inclination angle of the satellite at the next moment; obtaining rewards according to a satellite inclination angle at the next moment and a north-south maintenance strategy rewarding function; the north-south maintenance policy bonus function uses equation 1:
Figure GDA0004212552470000021
/>
wherein r is t Rewards, deltas obtained for the inclination control behaviour of satellites at the current moment t For the next moment of inclination difference at the current moment, deltas t =s t+1 -s 0 ,s 0 Is the inclination angle of a nominal orbit, s t+1 The satellite inclination angle is the next moment of the current moment; the satellite inclination angle difference at the next moment of the current moment is |s t+1 -s 0 I (I); t is the current time, t 0 The expected track control time closest to the current time;
s5: the satellite inclination angle at the current moment, the inclination angle control behavior executed by the satellite, rewards and the satellite inclination angle at the next moment are used as a group of satellite combination state data sets to be stored in an experience pool;
s6: taking out a plurality of groups of satellite combination state data sets from the experience pool, and calculating a target value of each satellite combination state data set according to the target neural network weight parameters;
s7: calculating an error according to the loss function, and updating the weight parameter of the current neural network;
s8: updating the Q value according to the value function; taking the satellite inclination angle at the next moment as the satellite inclination angle at the current moment;
s9: repeating steps S3-S8, the number of times steps S3-S8 are performed being equal to the expected orbit control number of the set of satellite training state data sets; after each repetition of the steps S3-S8 of the appointed iteration times, updating the weight parameters of the target neural network according to the weight parameters of the current neural network;
s10: steps S2-S9 are repeated until all the data of the satellite training state data set has been entered.
According to an exemplary embodiment of the present invention, initializing the model includes defining a loss function in step S1.
According to an exemplary embodiment of the present invention, the input of the model is the tilt angle of the satellite, and the output of the model is the return value (Q value) after the satellite performs the tilt angle control action.
According to an exemplary embodiment of the present invention, in step S1, the satellite inclination angle is a two-dimensional inclination angle on orbit, which is obtained from the satellite orbit inclination angle and the ascending intersection point, the right ascent and descent:
s=(i x ,i y );
Figure GDA0004212552470000031
where s represents the two-dimensional tilt angle of the satellite in orbit, i represents the tilt angle of the satellite in orbit, and Ω represents the right ascent point.
According to an exemplary embodiment of the present invention, the two-dimensional tilt angle is vector data.
According to an exemplary embodiment of the present invention, in step S3, the current satellite inclination angle is the initial satellite inclination angle at the time of the initial cycle.
According to an exemplary embodiment of the present invention, in step S3, the method for obtaining tilt control actions performed by a satellite according to a greedy strategy includes: the satellite randomly selects the dip angle control behavior at the next moment according to the first appointed probability or executes the dip angle control behavior corresponding to the maximum Q value according to the second appointed probability; the sum of the first specified probability and the second specified probability is equal to 1.
According to an exemplary embodiment of the present invention, in step S6, the method for calculating the target value of each satellite state data set according to the target neural network weight parameter uses formula 2:
Figure GDA0004212552470000032
wherein y is j Representing the target value, gamma being the discount value, theta' being the target neural network weight parameter,
Figure GDA0004212552470000033
representing the maximum Q value, s, of a satellite in a group of satellite combination state data sets after the next moment in time the satellite performs inclination angle control action a j+1 Representing the tilt angle of the satellite at the next moment in a set of satellite combination state data sets, a represents the tilt angle control action performed by the satellite at the current moment in a set of satellite combination state data sets, r j Representing rewards in a set of satellite combination status data sets.
According to an exemplary embodiment of the present invention, in step S7, the loss function uses formula 3:
Figure GDA0004212552470000034
wherein y is j Represents the target value, θ is the current neural network weight parameter, Q (s j ,a j The method comprises the steps of carrying out a first treatment on the surface of the θ) represents the current time satellite-performed tilt control behavior a in a set of state data sets j Q, s after j Representing the satellite inclination angle, a, at the current moment in a group of satellite combined state data sets j Representing tilt control actions performed by the satellite, m is the number of satellite combined state data sets.
According to an exemplary embodiment of the present invention, in step S8, the method for updating the Q value according to the value function uses formula 4:
Q(s t ,a t )←Q(s t ,a t )+α[r t +γmax Q(s t+1 ,a t )-Q(s t ,a t )] (4);
wherein Q(s) on the left side of the arrow t ,a t ) The satellite execution inclination control behavior a representing the updated current time t The Q value at the rear, Q(s) at the right side of the arrow t ,a t ) The satellite execution inclination control behavior a representing the current time before updating t Q(s) t+1 ,a t ) The next time satellite execution inclination control behavior a representing the current time before update t The Q value, alpha is weight, gamma is discount value, s t Representing the satellite inclination angle at the current moment, a t Representing the tilt control behavior performed by the satellite at the current time s t+1 Representing the satellite inclination angle at the next moment of the current moment, r t Indicating a reward.
t represents the current time, and t+1 represents the time next to the current time.
As a second aspect of the present invention, a system for maintaining a policy of north and south of a satellite based on Nature DQN is provided, and the system is built by using a modeling method of the satellite north and south maintaining policy model based on Nature DQN.
As a third aspect of the present invention, a method for obtaining a north-south maintenance optimal strategy of a satellite is provided, and a modeling method of the north-south maintenance strategy model of the satellite based on Nature DQN is adopted to build the north-south maintenance strategy model of the satellite based on Nature DQN;
obtaining an optimal strategy according to the model;
the method for obtaining the optimal strategy according to the model adopts a formula 5:
Figure GDA0004212552470000041
wherein pi represents the strategy of satellite inclination control, pi * Representing the optimal tilt control strategy learned by the model, namely, the satellite passes through the strategy pi under the condition that the initial moment of the satellite tilt is s * Yielding the greatest return for tilt control behavior a.
As a fourth aspect of the present invention, there is provided an electronic apparatus comprising:
one or more processors;
a storage means for storing one or more programs;
the one or more programs, when executed by the one or more processors, cause the one or more processors to implement a modeling method of the Nature DQN-based satellite north-south retention policy model.
As a fifth aspect of the present invention, there is provided a computer-readable medium having stored thereon a computer program which, when executed by a processor, implements a modeling method of the Nature DQN-based satellite north-south maintenance policy model.
The beneficial effects of the invention are as follows:
according to the scheme, modeling is performed through the neural network, deep reinforcement learning and decision making are performed by utilizing the current satellite inclination angle data, complex modeling is not needed by utilizing various perturbation forces received by the satellite in the orbit running process, an optimal north-south control strategy can be obtained, and consumption of satellite fuel can be reduced, so that the method has important significance and value for practical aerospace application.
Drawings
Fig. 1 schematically shows a step diagram of a modeling method of a satellite north-south retention policy model based on Nature DQN.
Fig. 2 schematically shows a block diagram of an electronic device.
Fig. 3 schematically shows a block diagram of a computer readable medium.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments can be embodied in many forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art. The same reference numerals in the drawings denote the same or similar parts, and thus a repetitive description thereof will be omitted.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the present application. One skilled in the relevant art will recognize, however, that the aspects of the application can be practiced without one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known methods, devices, implementations, or operations are not shown or described in detail to avoid obscuring aspects of the application.
The block diagrams depicted in the figures are merely functional entities and do not necessarily correspond to physically separate entities. That is, the functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.
The flow diagrams depicted in the figures are exemplary only, and do not necessarily include all of the elements and operations/steps, nor must they be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the order of actual execution may be changed according to actual situations.
It will be understood that, although the terms first, second, third, etc. may be used herein to describe various components, these components should not be limited by these terms. These terms are used to distinguish one element from another element. Thus, a first component discussed below could be termed a second component without departing from the teachings of the present application concept. As used herein, the term "and/or" includes any one of the associated listed items and all combinations of one or more.
Those skilled in the art will appreciate that the drawings are schematic representations of example embodiments, and that the modules or flows in the drawings are not necessarily required to practice the present application, and therefore, should not be taken to limit the scope of the present application.
The scheme obtains observation information from the environment based on the perception capability with strong deep learning, and obtains an expected return value to evaluate the footstock value based on the decision capability with strong reinforcement learning. The entire learning process can be described as: at a certain moment, the satellite interacts with the flying environment to acquire the observation information, the current state information is mapped into corresponding actions (control actions) through the neural network, the environment reacts to the actions to obtain corresponding reward values and next observation information, and the complete interaction information is stored in the experience pool. By continuously cycling the above processes, an optimal strategy for achieving the objective can be finally obtained.
The satellite in the scheme is a GEO triaxial stabilized satellite. Geosynchronous orbit (GEO), which refers to a circular orbit around the earth where satellites travel about 36000 km above the earth's equator. Because satellites synchronize with earth's rotation around the earth's period of travel, satellites that are in a relatively stationary state with respect to the earth are referred to as "geostationary satellites" for short, and are also referred to as "stationary satellites" or "fixed satellites". Triaxial stabilization is that the satellite does not rotate, and the body is stable in the directions X, Y, Z, in other words, maintains a certain attitude relation with the earth.
The Deep Q Network (DQN) algorithm is a network in deep reinforcement learning, and is a combination of deep learning and Q learning. Since it combines the advantages of reinforcement learning and deep learning, it has been widely used in various fields at present.
Deep reinforcement learning is used as a new research hotspot in the field of artificial intelligence, and combines deep learning and reinforcement learning, so that direct control and decision from original input to output are realized through an end-to-end learning mode. Because the deep learning is based on a neural network structure, the deep learning has stronger perceptibility to the environment, but lacks a certain decision control capability; whereas reinforcement learning happens to have very strong behavioural decision-making capability. Therefore, the deep reinforcement learning combines the perception capability of the deep learning and the decision capability of the reinforcement learning, has complementary advantages, and can directly learn the control strategy from the high-dimensional original data. Since the deep reinforcement learning method is proposed, a substantial breakthrough is made in a plurality of tasks requiring to perceive high-dimensional original input data and decision control, and the deep reinforcement learning can solve the problems of difficult modeling and difficult planning due to the end-to-end learning advantage of the deep learning.
The DQN algorithm uses the same network for calculating the target value and the current value, i.e. the target value is calculated by using parameters in the Q network to be trained currently, and the target value is used for updating the parameters of the network, so that the two are circularly dependent, and the convergence of the algorithm is not facilitated. Compared with the DQN, the Nature DQN is added with a target network, the dependency relationship between the target Q value calculation and the Q network parameters to be updated is reduced through a double-network structure, and the advantages of reinforcement learning and deep learning are integrated, so that the stability of the DQN algorithm is greatly improved.
Nature DQN reduces the correlation between the target value of the calculated target network and the current network parameters by using two independent but identical Q networks (one as the current Q network and the other as the target Q network). The current network updates the target network at intervals of a certain step length C by copying the weight parameters of the current network to the target network, and the target Q value is kept unchanged for a period of time by the double-network structure, so that the correlation between the calculated target Q value and the current network parameters is reduced, and the convergence and stability of the algorithm are improved.
As a first embodiment of the present invention, there is provided a modeling method of a satellite north-south maintenance policy model based on Nature DQN, as shown in fig. 1, comprising the steps of:
s1: initializing a model, and acquiring a plurality of groups of satellite training state data groups, wherein each group of satellite training state data groups comprises an initial state of a satellite, a plurality of expected orbit control moments and expected orbit control times; the initial state of each satellite includes an initial time satellite tilt angle.
The method for initializing the model comprises the following steps: defining a loss function; initializing the capacity of an experience pool to be N, wherein the experience pool is used for storing training samples; initializing a current neural network weight parameter theta and a target neural network weight parameter theta ', theta' =theta of a network model; initializing the appointed iteration times T1 of task training; the network input is initialized as the satellite inclination s, and the calculated network output is the return value Q after the satellite executes the inclination control action.
The motion state of a satellite at a certain moment can be represented by six numbers of kepler orbits: semi-long axis, eccentricity, right ascent point, near-place amplitude angle, orbit inclination angle and near-plane point angle.
In the north-south maintenance strategy, the orbit inclination i is shifted due to the influence of the gravity of the sun and the moon and the non-spherical perturbation of the earth in the running process of the satellite. The orientation of the satellite orbital plane in space is generally described by two orbital elements, namely an inclination i and an intersection point, right after omega. But in case of small tilt angles, the following orbital elements are used instead of i and Ω in order to avoid singularities:
Figure GDA0004212552470000081
the data of the satellite inclination angle is the two-dimensional inclination angle of the satellite on the orbit, the two-dimensional inclination angle is vector data, and the two-dimensional inclination angle vector of the satellite on the orbit can be expressed as:
s=(i x ,i y )。
thus, the two-dimensional tilt angle of a satellite on orbit is obtained from the satellite orbit tilt angle and the ascending intersection point, the right way:
s=(i x ,i y );
Figure GDA0004212552470000082
where s represents the two-dimensional tilt angle of the satellite in orbit, i represents the tilt angle of the satellite in orbit, and Ω represents the right ascent point.
The satellite training state data sets form a data set, the satellite state data in the data set is more than or equal to 100 groups, and the more the satellite state data is, the more accurate the model training result is.
The data of the satellite training state data sets are the data of the training set, and can be simulation data or combination of the simulation data and real data. The time line in a time period comprises a plurality of time points, the states of the satellites at each time point are different, and different effects can be obtained when the orbit control strategy is executed at different time points. According to the scheme, through the plurality of sets of satellite training state data sets, the satellite inclination angle of each set of satellite at the initial time corresponds to the satellite inclination angle of one time point, and the time points corresponding to the initial time of each set of satellite training state data sets are different, namely the initial time of each set of satellite training state data sets is different.
S2: and inputting the initial time satellite inclination angles of a group of satellite training state data sets into the model to obtain all inclination angle control behaviors and corresponding output Q values after the initial time.
And after the satellite at the initial moment executes the inclination angle control action, obtaining the inclination angle of the satellite at the next moment. And obtaining the inclination angle of the satellite at the next moment after the satellite at the next moment executes the inclination angle control action. And the like, the inclination angle control behaviors at a plurality of next moments are obtained.
S3: and acquiring the inclination angle of the satellite at the current moment, and acquiring the inclination angle control behavior executed by the satellite according to a greedy strategy.
And in the primary circulation, the satellite inclination angle at the current moment is the satellite inclination angle at the initial moment.
The method for obtaining the dip angle control behavior executed by the satellite according to the greedy strategy comprises the following steps: the satellite randomly selects the dip angle control behavior according to the first specified probability or executes the dip angle control behavior corresponding to the maximum Q value according to the second specified probability; the sum of the first specified probability and the second specified probability is equal to 1.
If the first specified probability is greater than the second specified probability, the method for obtaining the dip control behavior executed by the satellite according to the greedy strategy adopts the following steps: the satellite randomly selects dip control behaviors with a first specified probability;
if the second specified probability is greater than the first specified probability, the method for obtaining the dip control behavior executed by the satellite according to the greedy strategy adopts the following steps: the satellite executes the dip angle control behavior corresponding to the maximum Q value according to the second designated probability;
if the first specified probability is equal to the second specified probability, selecting one of the methods for obtaining the dip control behavior executed by the satellite according to the greedy strategy: the satellite randomly selects the dip control behavior with a first specified probability or executes the dip control behavior corresponding to the maximum Q value with a second specified probability.
The greedy strategy is epsilon-greedy strategy.
The first specified probability is epsilon, which decreases with increasing iteration number.
The dip angle control action executed by the satellite at the current moment is a t
S4: executing the inclination angle control action to obtain the inclination angle of the satellite at the next moment; and rewarding according to the satellite inclination angle at the next moment and the north-south maintenance strategy rewarding function.
The method for maintaining the dip angle of the satellite is to set the angle delta r s Maintaining a circle for the inclination of the radius, and allowing the inclination to continuously drift until the inclination approaches the radius delta r s And (3) performing tilt maneuver when the tilt angle of the tilt angle holding circle is at the upper boundary, so that the tilt angle vector jumps to the lower boundary in the tilt angle holding circle, and the tilt angle maneuver direction is basically along the opposite direction of the daily and monthly perturbation.
The goal of the satellite north-south maintenance strategy is to maintain orbital tilt while minimizing fuel consumption as much as possible, so the bonus strategy is defined to take into account both tilt variations and fuel consumption. The satellite initial mass is fixed and the total burnup depends on the sum of absolute values of the velocity increment per control, which in turn depends on the sum of absolute values of the tilt vector change produced per control.
Assuming that the satellite control frequency is fixed (i.e., orbit control is performed after a fixed period of time), it is desirable that the current control amount can ensure that the inclination angle is within the holding circle at the next control and that the control amount is as small as possible, i.e., that the inclination angle of the orbit at the next time not only satisfies the inclination angle within the inclination angle holding circle but also is as close as possible to the nominal orbit (theoretical orbit). The amount of change in the inclination angle at a certain time (time t) determines the state of the track inclination angle at the time next to the certain time (time next to time t) of the extrapolation. For this purpose, a bonus strategy at time t is designed. The inclination angle of the nominal orbit is s 0 The inclination angle keeps the radius of the circle delta r s The rewarding strategy at the moment t is that the rewarding function of the north-south maintaining strategy adopts the formula 1:
Figure GDA0004212552470000101
wherein r is t Rewards, deltas obtained for the inclination control behaviour of satellites at the current moment t For the next moment of inclination difference at the current moment, deltas t =s t+1 -s 0 ,s 0 Is the inclination angle of a nominal orbit, s t+1 The satellite inclination angle is the next moment of the current moment; the satellite inclination angle difference at the next moment of the current moment is |s t+1 -s 0 I (I); t is the current time, t 0 Is the expected orbit control time closest to the current time.
the time t+1 is the time next to the time t (current time). The extrapolated tilt angle at time t+1 is the tilt angle at the next time to time t.
S5: and storing the satellite inclination angle at the current moment, the inclination angle control action executed by the satellite, the rewards and the satellite inclination angle at the next moment as a group of satellite combination state data sets into an experience pool.
S6: and taking out a plurality of groups of satellite combination state data sets from the experience pool, and calculating the target value of each satellite combination state data set according to the target neural network weight parameter.
The number of satellite combined state data sets is m, m is a natural number greater than 0, and m is less than the number of satellite training state data sets. The m sets of satellite combination state data sets are small batches of satellite combination state data sets. The number of satellite combination state data sets is determined based on the number of satellite training state data sets.
The method for calculating the target value of each satellite combination state data set according to the target neural network weight parameters adopts a formula 2:
Figure GDA0004212552470000102
wherein y is j Representing the target value, gamma being the discount value, theta' being the target neural network weight parameter,
Figure GDA0004212552470000111
representing the maximum Q value, s, of a satellite in a group of satellite combination state data sets after the next moment in time the satellite performs inclination angle control action a j+1 Representing the tilt angle of the satellite at the next moment in a set of satellite combination state data sets, a represents the tilt angle control action performed by the satellite at the current moment in a set of satellite combination state data sets, r j Representing rewards in a set of satellite combination status data sets.
And stopping the task to obtain model convergence or iteration completion. When s is j+1 When model convergence or iteration is completed, y i Equal to r j The method comprises the steps of carrying out a first treatment on the surface of the When s is j+1 When model convergence or iteration is not completed, y i Equal to
Figure GDA0004212552470000112
The conditions for model convergence are: the error calculated by the loss function is within a specified range.
The iteration is completed under the following conditions: all steps are performed.
S7: and calculating errors according to the loss function, and updating the weight parameters of the current neural network.
The loss function uses equation 3:
Figure GDA0004212552470000113
wherein y is j Represents the target value, θ is the current neural network weight parameter, Q (s j ,a j The method comprises the steps of carrying out a first treatment on the surface of the θ) represents the tilt control behavior a performed by the satellite at the current time in a set of satellite combined state data sets j Q, s after j Representing the satellite inclination angle, a, at the current moment in a group of satellite combined state data sets j Representing tilt control actions performed by the satellite, m is the number of satellite combined state data sets.
The error is the result of the loss function calculation using equation 3.
The current neural network weight parameters are updated by a random gradient descent method (SGD).
r t 、a t 、s t 、s t+1 Samples in a dataset representing a satellite training state dataset, r j 、a j 、s j 、s j+1 Representing samples in the experience pool.
And the steps S5-S7 are used for adjusting the parameters of the model, so that the calculation accuracy of the model is higher.
S8: and updating the Q value according to the value function, and taking the satellite inclination angle at the next moment as the current satellite inclination angle.
The method of updating the Q value according to the value function employs equation 4:
Q(s t ,a t )←Q(s t ,a t )+α[r t +γmax Q(s t+1 ,a t )-Q(s t ,a t )] (4);
wherein Q(s) on the left side of the arrow t ,a t ) The satellite execution inclination control behavior a representing the updated current time t The Q value at the rear, Q(s) at the right side of the arrow t ,a t ) The satellite execution inclination control behavior a representing the current time before updating t Q(s) t+1 ,a t ) The next time satellite execution inclination control behavior a representing the current time before update t The Q value, alpha is weight, gamma is discount value, s t Representing the satellite inclination angle at the current moment, a t Representing the tilt control behavior performed by the satellite at the current time s t+1 Representing the satellite inclination angle at the next moment of the current moment, r t Indicating a reward.
Wherein alpha and gamma are both in the range of 0 to 1.
S9: repeating steps S3-S8, the number of times steps S3-S8 are performed being equal to the expected orbit control number of the set of satellite state data sets; and after each step S3-S8 of the appointed iteration times is repeatedly executed, updating the weight parameters of the target neural network according to the weight parameters of the current neural network.
After the iteration of the appointed iteration times T1 is completed, the target neural network weight parameter is updated to the current neural network weight parameter. In this way, the correlation between the target value of the calculation target network and the current network parameter is reduced. The current network updates the target network at intervals of a certain step length C by copying the weight parameters of the current network to the target network, and the target Q value is kept unchanged for a period of time by the double-network structure, so that the correlation between the calculated target Q value and the current network parameters is reduced, and the convergence and stability of the algorithm are improved.
S10: steps S2-S9 are repeated until all the data of the satellite training state data set has been entered.
According to the modeling method, satellite inclination angle data is used as input of a neural network model, generated return values are used as output, a Nature DQN neural network is adopted, complex modeling is not needed by utilizing various perturbation forces received by a satellite in an orbit running process, deep reinforcement learning is directly adopted for learning and decision making, the DQN algorithm is adopted for improvement, the method is suitable for training a large-scale neural network, the stability of the DQN algorithm is greatly improved, an optimal north-south control strategy can be obtained, and consumption of satellite fuel can be reduced.
According to a second specific embodiment of the invention, the invention provides a satellite north-south maintenance strategy system based on Nature DQN, and the system is built by adopting the modeling method of the satellite north-south maintenance strategy model based on the Nature DQN of the first embodiment.
According to a third specific embodiment of the invention, the invention provides a method for acquiring a satellite north-south maintenance optimal strategy, and the method for modeling a satellite north-south maintenance strategy model based on Nature DQN of the first embodiment is adopted to build the satellite north-south maintenance strategy model based on the Nature DQN;
and obtaining an optimal strategy according to the model.
The method for obtaining the optimal strategy according to the model adopts the formula 5:
Figure GDA0004212552470000131
wherein pi represents the strategy of satellite inclination control, pi * Representation generalThe optimal inclination angle control strategy obtained by model learning, namely the satellite passes through the strategy pi under the condition that the initial moment satellite inclination angle is s * Yielding the greatest return for tilt control behavior a.
According to a fourth embodiment of the present invention, an electronic device is provided, as shown in fig. 2, and fig. 2 is a block diagram of an electronic device according to an exemplary embodiment.
An electronic device 200 according to this embodiment of the present application is described below with reference to fig. 2. The electronic device 200 shown in fig. 2 is only an example and should not be construed as limiting the functionality and scope of use of the embodiments herein.
As shown in fig. 2, the electronic device 200 is in the form of a general purpose computing device. The components of the electronic device 200 may include, but are not limited to: at least one processing unit 210, at least one memory unit 220, a bus 230 connecting the different system components (including the memory unit 220 and the processing unit 210), a display unit 240, and the like.
Wherein the storage unit stores program code that is executable by the processing unit 210 such that the processing unit 210 performs the steps described in the present specification according to various exemplary embodiments of the present application. For example, the processing unit 210 may perform the steps as shown in fig. 1.
The memory unit 220 may include readable media in the form of volatile memory units, such as Random Access Memory (RAM) 2201 and/or cache memory 2202, and may further include Read Only Memory (ROM) 2203.
The storage unit 220 may also include a program/utility 2204 having a set (at least one) of program modules 2205, such program modules 2205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.
Bus 230 may be a bus representing one or more of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.
The electronic device 200 may also communicate with one or more external devices 200' (e.g., keyboard, pointing device, bluetooth device, etc.), devices that enable a user to interact with the electronic device 200, and/or any devices (e.g., routers, modems, etc.) that the electronic device 200 can communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 250. Also, the electronic device 200 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet, through a network adapter 260. Network adapter 260 may communicate with other modules of electronic device 200 via bus 230. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with electronic device 200, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.
From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware.
Thus, according to a fifth embodiment of the present invention, the present invention provides a computer readable medium. As shown in fig. 3, the technical solution according to the embodiment of the present invention may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, and includes several instructions to cause a computing device (may be a personal computer, a server, or a network device, etc.) to perform the above-described method according to the embodiment of the present invention.
The software product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The computer readable storage medium may include a data signal propagated in baseband or as part of a carrier wave, with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable storage medium may also be any readable medium that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).
The computer-readable medium carries one or more programs, which when executed by one of the devices, cause the computer-readable medium to implement the functions of the first embodiment.
Those skilled in the art will appreciate that the modules may be distributed throughout several devices as described in the embodiments, and that corresponding variations may be implemented in one or more devices that are unique to the embodiments. The modules of the above embodiments may be combined into one module, or may be further split into a plurality of sub-modules.
From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present invention may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, and includes several instructions to cause a computing device (may be a personal computer, a server, a mobile terminal, or a network device, etc.) to perform the method according to the embodiments of the present invention.
The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A modeling method of a satellite north-south maintenance strategy model based on Nature DQN is characterized by comprising the following steps:
s1: initializing a model, and acquiring a plurality of groups of satellite training state data groups, wherein each group of satellite training state data groups comprises an initial state of a satellite, a plurality of expected orbit control moments and expected orbit control times; the initial state of each satellite comprises an initial moment satellite inclination angle;
s2: inputting the initial time satellite inclination angles of a group of satellite training state data sets into the model to obtain all inclination angle control behaviors after the initial time and Q values corresponding to the initial time, wherein the Q values are return values after the satellite executes the inclination angle control behaviors;
s3: acquiring the inclination angle of a satellite at the current moment, and acquiring the inclination angle control behavior executed by the satellite according to a greedy strategy;
s4: executing the inclination angle control action to obtain the inclination angle of the satellite at the next moment; obtaining rewards according to a satellite inclination angle at the next moment and a north-south maintenance strategy rewarding function; the north-south maintenance strategy rewarding function adopts a formula (1):
Figure FDA0004212552460000011
wherein r is t Rewards, deltas obtained for the inclination control behaviour of satellites at the current moment t For the next moment of inclination difference at the current moment, deltas t =s t+1 -s 0 ,s 0 Is the inclination angle of a nominal orbit, s t+1 The satellite inclination angle is the next moment of the current moment; the satellite inclination angle difference at the next moment of the current moment is |s t+1 -s 0 I (I); t is the current time, t 0 The expected track control time closest to the current time;
s5: the satellite inclination angle at the current moment, the inclination angle control behavior executed by the satellite, rewards and the satellite inclination angle at the next moment are used as a group of satellite combination state data sets to be stored in an experience pool;
s6: taking out a plurality of groups of satellite combination state data sets from the experience pool, and calculating a target value of each satellite combination state data set according to the target neural network weight parameters;
s7: calculating an error according to the loss function, and updating the weight parameter of the current neural network;
s8: updating the Q value according to the value function; taking the satellite inclination angle at the next moment as the satellite inclination angle at the current moment;
s9: repeating steps S3-S8, the number of times steps S3-S8 are performed being equal to the expected orbit control number of the set of satellite training state data sets; after each repetition of the steps S3-S8 of the appointed iteration times, updating the weight parameters of the target neural network according to the weight parameters of the current neural network;
s10: steps S2-S9 are repeated until all the data of the satellite training state data set has been entered.
2. The modeling method of a north-south maintenance strategy model of a satellite based on Nature DQN according to claim 1, wherein in step S1, the satellite inclination angle is a two-dimensional inclination angle of the satellite on orbit, and the two-dimensional inclination angle is obtained by the satellite orbit inclination angle and the ascending intersection point in a right-angle manner;
s=(i x ,i y );
Figure FDA0004212552460000021
where s represents the two-dimensional tilt angle of the satellite in orbit, i represents the tilt angle of the satellite in orbit, and Ω represents the right ascent point.
3. The modeling method of a north-south maintenance policy model for a satellite based on Nature DQN according to claim 1, wherein in step S3, the method for obtaining tilt control behavior performed by the satellite according to greedy policy includes: the satellite randomly selects the dip angle control behavior according to the first specified probability or executes the dip angle control behavior corresponding to the maximum Q value according to the second specified probability; the sum of the first specified probability and the second specified probability is equal to 1.
4. The modeling method of a satellite north-south maintenance strategy model based on Nature DQN according to claim 1, wherein in step S6, the method of calculating the target value of each satellite combined state data set according to the target neural network weight parameter uses formula (2):
Figure FDA0004212552460000022
wherein y is j Indicating the target value gammaFor the discount value, θ' is the target neural network weight parameter,
Figure FDA0004212552460000023
representing the maximum Q value, s, of a satellite in a group of satellite combination state data sets after the next moment in time the satellite performs inclination angle control action a j+1 Representing the tilt angle of the satellite at the next moment in a set of satellite combination state data sets, a represents the tilt angle control action performed by the satellite at the current moment in a set of satellite combination state data sets, r j Representing rewards in a set of satellite combination status data sets.
5. The modeling method of a satellite north-south retention policy model based on Nature DQN according to claim 1, wherein in step S7, the loss function uses formula (3):
Figure FDA0004212552460000031
wherein y is j Represents the target value, θ is the current neural network weight parameter, Q (s j ,a j The method comprises the steps of carrying out a first treatment on the surface of the θ) represents the tilt control behavior a performed by the satellite at the current time in a set of satellite combined state data sets j Q, s after j Representing the satellite inclination angle, a, at the current moment in a group of satellite combined state data sets j Representing tilt control actions performed by the satellite, m is the number of satellite combined state data sets.
6. The modeling method of a satellite north-south maintenance strategy model based on Nature DQN according to claim 1, wherein in step S8, the method of updating Q value according to a value function uses formula (4):
Q(s t ,a t )←Q(s t ,a t )+α[r t +γmaxQ(s t+1 ,a t )-Q(s t ,a t )](4);
wherein Q(s) on the left side of the arrow t ,a t ) Representing updated current timeThe carved satellite performs tilt control action a t The Q value at the rear, Q(s) at the right side of the arrow t ,a t ) The satellite execution inclination control behavior a representing the current time before updating t Q(s) t+1 ,a t ) The next time satellite execution inclination control behavior a representing the current time before update t The Q value, alpha is weight, gamma is discount value, s t Representing the satellite inclination angle at the current moment, a t Representing the tilt control behavior performed by the satellite at the current time s t+1 Representing the satellite inclination angle at the next moment of the current moment, r t Indicating a reward.
7. A satellite north-south maintenance strategy system based on Nature DQN, characterized in that the system is built using the modeling method of any of claims 1-6.
8. A method for acquiring a satellite north-south maintenance optimal strategy, which is characterized in that a satellite north-south maintenance strategy model based on Nature DQN is established according to the modeling method of any one of claims 1-6;
obtaining an optimal strategy according to the model;
the method for obtaining the optimal strategy according to the model adopts a formula (5):
Figure FDA0004212552460000032
wherein pi represents the strategy of satellite inclination control, pi * Representing the optimal tilt control strategy learned by the model, namely, the satellite passes through the strategy pi under the condition that the initial moment of the satellite tilt is s * Yielding the greatest return for tilt control behavior a.
9. An electronic device, comprising:
one or more processors;
a storage means for storing one or more programs;
when executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1-6.
10. A computer readable medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any of claims 1-6.
CN202211408049.1A 2022-11-10 2022-11-10 Modeling method, system and acquisition method for satellite north-south maintenance strategy model Active CN115865166B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211408049.1A CN115865166B (en) 2022-11-10 2022-11-10 Modeling method, system and acquisition method for satellite north-south maintenance strategy model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211408049.1A CN115865166B (en) 2022-11-10 2022-11-10 Modeling method, system and acquisition method for satellite north-south maintenance strategy model

Publications (2)

Publication Number Publication Date
CN115865166A CN115865166A (en) 2023-03-28
CN115865166B true CN115865166B (en) 2023-06-13

Family

ID=85663095

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211408049.1A Active CN115865166B (en) 2022-11-10 2022-11-10 Modeling method, system and acquisition method for satellite north-south maintenance strategy model

Country Status (1)

Country Link
CN (1) CN115865166B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111240345A (en) * 2020-02-11 2020-06-05 哈尔滨工程大学 Underwater robot trajectory tracking method based on double BP network reinforcement learning framework

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108875244B (en) * 2018-06-29 2020-05-12 北京航空航天大学 Orbit prediction precision improvement method based on random forest
CN110012516B (en) * 2019-03-28 2020-10-23 北京邮电大学 Low-orbit satellite routing strategy method based on deep reinforcement learning architecture
US11674384B2 (en) * 2019-05-20 2023-06-13 Schlumberger Technology Corporation Controller optimization via reinforcement learning on asset avatar
CN114362810B (en) * 2022-01-11 2023-07-21 重庆邮电大学 Low orbit satellite beam jump optimization method based on migration depth reinforcement learning
CN114967453A (en) * 2022-05-25 2022-08-30 北京理工大学 Satellite east-west coordination state initial value estimation method based on neural network
CN114933028B (en) * 2022-07-21 2022-11-11 北京航天驭星科技有限公司 Dual-star-orbit control strategy control method and device, electronic equipment and storage medium

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111240345A (en) * 2020-02-11 2020-06-05 哈尔滨工程大学 Underwater robot trajectory tracking method based on double BP network reinforcement learning framework

Also Published As

Publication number Publication date
CN115865166A (en) 2023-03-28

Similar Documents

Publication Publication Date Title
Sullivan et al. Using reinforcement learning to design a low-thrust approach into a periodic orbit in a multi-body system
Boain AB-Cs of sun-synchronous orbit mission design
Kiesbye et al. Hardware-in-the-loop and software-in-the-loop testing of the move-ii cubesat
CN115795816B (en) Modeling method, model and acquisition method of satellite east-west maintenance strategy model
CN115758707B (en) Modeling method, system and acquisition method of satellite east-west maintenance strategy model
CN115795817B (en) Modeling method, system and acquisition method of satellite east-west maintenance strategy model
Bonasera et al. Designing impulsive station-keeping maneuvers near a sun-earth l2 halo orbit via reinforcement learning
Erlank et al. Reliability analysis of multicellular system architectures for low-cost satellites
Guzzetti Coupled orbit-attitude mission design in the circular restricted three-body problem
CN115865166B (en) Modeling method, system and acquisition method for satellite north-south maintenance strategy model
CN115758704B (en) Modeling method, system and acquisition method for satellite north-south maintenance strategy model
CN115758705B (en) Modeling method, system and acquisition method for satellite north-south maintenance strategy model
CN115865167B (en) Modeling method, system and acquisition method for satellite north-south maintenance strategy model
CN115758706B (en) Modeling method, model and acquisition method of satellite east-west maintenance strategy model
Frank Reflecting on planning models: A challenge for self-modeling systems
Fraser Adaptive extended Kalman filtering strategies for autonomous relative navigation of formation flying spacecraft
CN115806062B (en) Modeling method, system and acquisition method of satellite relative phase maintaining strategy model
CN115892516B (en) Modeling method, model and acquisition method of satellite relative phase maintaining strategy model
Pan et al. Nonlinear dynamics of displaced non-Keplerian orbits with low-thrust propulsion
CN115806060B (en) Modeling method, model and acquisition method of satellite relative phase maintaining strategy model
CN115806061B (en) Modeling method, model and acquisition method of satellite relative phase maintaining strategy model
Namazyfard Computational exploration of the cislunar region and implications for debris mitigation
Wu et al. Trajectory optimization and maintenance for ascending from the surface of Phobos
LaFarge Reinforcement learning approaches for autonomous guidance and control in a low-thrust, multi-body dynamical environment
Bowen On-board orbit determination and 3-axis attitude determination for picosatellite applications

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant