CN115865166A

CN115865166A - Modeling method, model and acquisition method of satellite north-south conservation strategy model

Info

Publication number: CN115865166A
Application number: CN202211408049.1A
Authority: CN
Inventors: 吴琳琳; 吴新林; 何镇武; 吴凌根; 陈倩茹; 王丽颖; 张琳娜
Original assignee: Emposat Co Ltd
Current assignee: Emposat Co Ltd
Priority date: 2022-11-10
Filing date: 2022-11-10
Publication date: 2023-03-28
Anticipated expiration: 2042-11-10
Also published as: CN115865166B

Abstract

The invention relates to the field of aerospace, and provides a modeling method, a model and an acquisition method of a satellite north-south maintenance strategy model, wherein the modeling method comprises the following steps: s1: initializing a model; s2: obtaining the inclination angle control behavior and the Q value; s3: obtaining a tilt angle control action executed by a satellite; s4: executing the inclination angle control action to obtain reward; s5: storing the satellite combination state data set into an experience pool; s6: calculating a target value; s7: calculating an error according to the loss function, and updating the weight parameter of the current neural network; s8: updating the Q value according to the value function; s9: repeating steps S3-S8, wherein the times for executing steps S3-S8 is equal to the expected orbit control times of the set of satellite training state data; updating the weight parameter of the target neural network after the steps S3-S8 of the appointed iteration times are repeatedly executed; s10: and repeatedly executing the steps S2-S9 until all the data of the satellite training state data set are input. The scheme can obtain an optimal decision strategy and reduce the consumption of satellite fuel.

Description

Modeling method, model and acquisition method of satellite north-south conservation strategy model

Technical Field

The invention relates to the technical field of aerospace, in particular to a satellite north-south conservation strategy model modeling method, a satellite north-south conservation strategy model obtaining method, satellite north-south conservation strategy model obtaining equipment and satellite north-south conservation strategy model medium based on NatureDQN.

Background

With the continuous development of human aerospace activities, more and more remote sensing satellites provide help for the daily life of people.

The GEO satellite is influenced by day and month gravitation and earth non-spherical perturbation in the operation process, so that drift occurs in the south-north latitude direction, and therefore the GEO satellite plays a vital role in controlling the south-north position maintenance (inclination angle maintenance) of the GEO three-axis stable satellite in the aerospace field.

In the prior art, a dynamic model during orbital maneuver is established by analyzing various types of perturbation force, such as earth spherical shape, sun-moon attraction, sunlight pressure and the like, which are applied to a satellite in an orbital operation process, and then short-term and long-term strategies for keeping north and south are formulated. The method carries out complex modeling through various perturbation forces received by the satellite in the process of orbital operation, however, due to the complexity of space stress and the uncertainty of parameters of the satellite, the satellite cannot be accurately modeled, the number of the parameters is large, the calculation is complex, the precision of satellite inclination angle control is further influenced, and more fuel may be consumed. Moreover, the conventional reinforcement learning method cannot solve the problem of high dimensionality of a state space and an action space.

Therefore, it is urgently needed to develop a modeling method, a model, an obtaining method, equipment and a medium of a satellite north-south conservation strategy model based on the NatureDQN, reduce modeling difficulty and accurately calculate the north-south conservation strategy.

Disclosure of Invention

The invention aims to provide a modeling method, a model, an obtaining method, equipment and a medium based on a NatureDQN satellite north-south maintenance strategy model, which are used for maintaining the north-south position of a GEO triaxial stabilized satellite without complex modeling and considering the complexity of space stress and the uncertainty of the satellite parameters, have strong behavior decision capability in reinforcement learning, can obtain an optimal decision strategy and reduce the consumption of satellite fuel.

In order to solve the above technical problems, as an aspect of the present invention, a method for modeling a satellite north-south conservation strategy model based on Nature DQN is provided, including the steps of:

s1: initializing a model, and acquiring a plurality of groups of satellite training state data sets, wherein each group of satellite training state data set comprises an initial state of a satellite, a plurality of expected orbit control moments and expected orbit control times; the initial state of each satellite comprises an initial time satellite inclination angle;

s2: inputting initial time satellite inclination angles of a group of satellite training state data sets into the model to obtain all inclination angle control behaviors after the initial time and corresponding output Q values;

s3: acquiring the inclination angle of the satellite at the current moment, and acquiring the inclination angle control behavior executed by the satellite according to a greedy strategy;

s4: executing the inclination angle control action to obtain the inclination angle of the satellite at the next moment; awarding is obtained according to the satellite inclination angle and the north-south maintenance strategy awarding function at the next moment; the north-south retention policy reward function adopts formula 1:

/>

wherein r is _t Reward, Δ s, for the tilt control action of the satellite at the current moment _t Is the next time inclination angle difference, Δ s, of the current time _t ＝s _t+1 -s ₀ ，s ₀ Is the angle of inclination of the nominal track, s _t+1 The satellite inclination angle is the next moment of the current moment; the satellite inclination angle difference at the next moment of the current moment is | s _t+1 -s ₀ L, |; t is the current time, t ₀ The expected orbit control time closest to the current time is obtained;

s5: storing the satellite inclination angle at the current moment, the inclination angle control action executed by the satellite, the reward and the satellite inclination angle at the next moment into an experience pool as a group of satellite combination state data groups;

s6: taking out a plurality of satellite combination state data sets from the experience pool, and calculating the target value of each satellite combination state data set according to the target neural network weight parameter;

s7: calculating an error according to the loss function, and updating the current weight parameter of the neural network;

s8: updating the Q value according to the value function; taking the satellite inclination at the next moment as the satellite inclination at the current moment;

s9: repeating steps S3-S8, wherein the times for executing steps S3-S8 is equal to the expected orbit control times of the set of satellite training state data; after the steps S3-S8 of the appointed iteration times are executed repeatedly, updating the weight parameter of the target neural network according to the weight parameter of the current neural network;

s10: and repeatedly executing the steps S2-S9 until all the data of the satellite training state data set are input.

According to an exemplary embodiment of the invention, initializing the model in step S1 comprises defining a loss function.

According to an exemplary embodiment of the present invention, the input of the model is the satellite tilt angle, and the output of the model is the return value (Q value) after the satellite performs the tilt angle control action.

According to an exemplary embodiment of the present invention, in step S1, the satellite inclination is a two-dimensional inclination on the orbit, the two-dimensional inclination being obtained from the satellite orbit inclination and the ascent point right ascension:

s＝(i _x ，i _y )；

wherein s represents a two-dimensional inclination angle of the satellite on the orbit, i represents an inclination angle of the satellite orbit, and Ω represents a right ascension of the ascending intersection point.

According to an exemplary embodiment of the invention, the two-dimensional tilt angle is vector data.

According to an exemplary embodiment of the present invention, in step S3, during the first loop, the satellite inclination at the current time is the satellite inclination at the initial time.

According to an exemplary embodiment of the present invention, in step S3, the method for obtaining the tilt control behavior performed by the satellite according to the greedy policy includes: the satellite randomly selects the inclination angle control behavior at the next moment according to a first specified probability or executes the inclination angle control behavior corresponding to the maximum Q value according to a second specified probability; the sum of the first specified probability and the second specified probability equals 1.

According to an exemplary embodiment of the present invention, in step S6, the method for calculating the target value of each satellite state data set according to the target neural network weight parameter uses formula 2:

wherein, y _j Representing a target value, gamma is a discount value, theta' is a target neural network weight parameter,

represents the maximum Q value, s, of the satellite performing the inclination control action a at the next moment in the set of satellite combined state data sets _j+1 Representing the satellite inclination of the next moment in a group of satellite combination state data sets, a representing the inclination control action executed by the satellite at the current moment in a group of satellite combination state data sets, r _j A reward is indicated in a set of satellite constellation state data sets.

According to an exemplary embodiment of the present invention, in step S7, the loss function adopts formula 3:

wherein, y _j Representing the target value, theta is the current weight parameter of the neural network, Q(s) _j ，a _j (ii) a Theta) represents the satellite performing the inclination control action a at the current moment in a set of state data sets _j Value of Q after, s _j Representing the satellite inclination at the current moment in a set of satellite constellation state data sets, a _j Representing the tilt control actions performed by the satellite, and m is the number of satellite combined state data sets.

According to an exemplary embodiment of the present invention, in step S8, the method for updating Q value according to the value function adopts formula 4:

Q(s _t ，a _t )←Q(s _t ，a _t )+α[r _t +γmax Q(s _t+1 ，a _t )-Q(s _t ，a _t )] (4)；

wherein, Q(s) at the left side of the arrow _t ，a _t ) Performing an inclination control action a of the satellite representing the updated current time _t The latter Q value, Q(s) on the right side of the arrow _t ，a _t ) The satellite representing the current moment before updating executes the inclination control action a _t Later Q value, Q(s) _t+1 ，a _t ) The satellite executes the inclination angle control action a at the next moment of the current moment before the update _t The latter Q value, alpha is the weight, gamma is the discount value, s _t Representing the satellite inclination at the current moment, a _t Representing the inclination control action, s, performed by the satellite at the current moment _t+1 The satellite inclination at the next moment, r, representing the current moment _t Indicating a reward.

t represents the current time, and t +1 represents the time next to the current time.

The invention provides a satellite north-south conservation strategy model based on the Nature DQN, and a modeling method of the satellite north-south conservation strategy model based on the Nature DQN is adopted to establish the model.

As a third aspect of the present invention, a method for obtaining a satellite north-south conservation optimal strategy is provided, wherein a satellite north-south conservation strategy model based on Nature DQN is established by using the modeling method of the satellite north-south conservation strategy model based on Nature DQN;

obtaining an optimal strategy according to the model;

the method for obtaining the optimal strategy according to the model adopts a formula 5:

wherein, pi represents the strategy of the satellite for controlling the inclination angle, pi ^* Represents the optimal inclination control strategy learned by the model, i.e. the satellite passes through the strategy pi under the condition that the satellite inclination at the initial moment is s ^* Produces the greatest return under the tilt control behavior a.

As a fourth aspect of the present invention, there is provided an electronic apparatus comprising:

one or more processors;

storage means for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement a method for modeling a north-south preserving policy model for a NatureDQN-based satellite.

As a fifth aspect of the present invention, there is provided a computer readable medium, having stored thereon a computer program, which when executed by a processor, implements the method of modeling the NatureDQN-based satellite north-south preservation policy model.

The beneficial effects of the invention are:

according to the scheme, the modeling is carried out through the neural network, the current satellite inclination angle data is utilized to carry out deep reinforcement learning and decision making, complex modeling is not required to be carried out through various perturbation forces received by the satellite in the orbital operation process, an optimal north-south control strategy can be obtained, the consumption of satellite fuel can be reduced, and the method has important significance and value for practical aerospace application.

Drawings

Fig. 1 schematically shows a step diagram of a modeling method of a satellite north-south conservation strategy model based on NatureDQN.

Fig. 2 schematically shows a block diagram of an electronic device.

FIG. 3 schematically shows a block diagram of a computer-readable medium.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The same reference numerals denote the same or similar parts in the drawings, and thus, a repetitive description thereof will be omitted.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the application. One skilled in the relevant art will recognize, however, that the embodiments of the present application can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations, or operations have not been shown or described in detail to avoid obscuring aspects of the application.

The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.

The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

It will be understood that, although the terms first, second, third, etc. may be used herein to describe various components, these components should not be limited by these terms. These terms are used to distinguish one element from another. Thus, a first component discussed below may be termed a second component without departing from the teachings of the present concepts. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

It will be appreciated by those skilled in the art that the drawings are merely schematic representations of exemplary embodiments, and that the blocks or processes shown in the drawings are not necessarily required to practice the present application and are, therefore, not intended to limit the scope of the present application.

According to the scheme, observation information is obtained from the environment based on strong perception capability of deep learning, and the expected return value is obtained based on strong decision-making capability of reinforcement learning to evaluate the value of the footstock. The entire learning process can be described as: at a certain moment, the satellite interacts with the flying environment to acquire observation information, the current state information is mapped into corresponding actions (control actions) through the neural network, the environment reacts to the actions to acquire corresponding reward values and next observation information, and the complete interaction information is stored in an experience pool. By continuously cycling the above processes, the optimal strategy for achieving the target can be finally obtained.

The satellite described in the scheme is a GEO triaxial stable satellite. Geosynchronous orbit (GEO), refers to a circular orbit of a satellite orbiting the earth at approximately 36000 kilometers above the equator of the earth. Satellites that orbit are referred to as "geostationary satellites," also known as "geostationary satellites," or "stationary satellites," because the satellites orbit the earth in a relatively stationary state, synchronized with their rotation about the earth. The three-axis stability means that the satellite does not rotate, and the body is stable in X, Y, Z directions, in other words, a certain attitude relationship is kept with the earth.

The Deep Q Network (DQN) algorithm is a network in deep reinforcement learning, and is a combination of deep learning and Q learning. The method integrates the advantages of reinforcement learning and deep learning, so that the method is widely applied to various fields at present.

Deep reinforcement learning is taken as a new research hotspot in the field of artificial intelligence, combines the deep learning with the reinforcement learning, and realizes direct control and decision from original input to output through an end-to-end learning mode. Because the deep learning is based on a neural network structure, the deep learning has stronger perception capability to the environment, but lacks certain decision control capability; and reinforcement learning has very strong behavior decision-making capability. Therefore, the perception capability of deep learning and the decision capability of the reinforcement learning are combined in the deep reinforcement learning, the advantages are complementary, and the control strategy can be directly learned from high-dimensional original data. Since the deep reinforcement learning method is provided, substantial breakthrough is achieved in a plurality of tasks requiring sensing of high-dimensional original input data and decision control, and due to the end-to-end learning advantage of deep learning, the problems of difficult modeling and difficult planning can be solved by the deep reinforcement learning.

The DQN algorithm uses the same network for calculating the target value and the current value, i.e., the calculation of the target value uses the parameters in the Q network to be trained currently, and the target value is used for updating the parameters of the network, so that the two depend on each other circularly, which is not favorable for the convergence of the algorithm. Compared with DQN, the Nature DQN increases a target network, reduces the dependency relationship between the calculation of a target Q value and the Q network parameter to be updated through a double-network structure, and integrates the advantages of reinforcement learning and deep learning, thereby greatly improving the stability of the DQN algorithm.

Nature DQN reduces the correlation between the target values of the computational target network and the current network parameters by using two independent but identical Q networks (one as the current Q network and the other as the target Q network). The target network is updated at regular intervals by copying the weight parameters of the current network to the target network, and the target Q value is kept unchanged in a period of time by the double-network structure, so that the correlation between the calculated target Q value and the current network parameters is reduced, and the convergence and the stability of the algorithm are improved.

As a first embodiment of the present invention, there is provided a method for modeling a satellite north-south conservation strategy model based on Nature DQN, as shown in fig. 1, including the steps of:

s1: initializing a model, and acquiring a plurality of groups of satellite training state data sets, wherein each group of satellite training state data set comprises an initial state of a satellite, a plurality of expected orbit control moments and expected orbit control times; the initial state of each satellite includes an initial time satellite inclination.

The method for initializing the model comprises the following steps: defining a loss function; initializing the capacity of an experience pool to be N, wherein the experience pool is used for storing training samples; initializing a current neural network weight parameter theta and a target neural network weight parameter theta ', theta' = theta of the network model; initializing a specified iteration number T1 of task training; the initial network input is the satellite inclination angle s, and the calculated network output is the return value Q after the satellite executes the inclination angle control action.

The motion state of a satellite at a certain time can be represented by six numbers of kepler orbits: semimajor axis, eccentricity, ascension at ascending intersection, argument of perigee, inclination of orbit, and angle of mean and perigee.

In the north-south maintenance strategy problem, the orbit inclination angle i is subject to drift because the satellite is influenced by the sun and moon gravitation and the earth aspheric perturbation during the operation. The orientation of the satellite orbital plane in space is generally described by two orbital elements, the inclination angle i and the elevation intersection right ascension omega. However, in the case of small tilt angles, to avoid singularities, the following track elements are used instead of i and Ω:

the data of the satellite inclination angle is a two-dimensional inclination angle of the satellite on the orbit, the two-dimensional inclination angle is vector data, and a two-dimensional inclination angle vector of the satellite on the orbit can be expressed as follows:

s＝(i _x ，i _y )。

thus, the two-dimensional inclination of the satellite in orbit is obtained from the satellite orbit inclination and the ascent point right ascension:

s＝(i _x ，i _y )；

The data set is composed of a plurality of groups of satellite training state data sets, the data of the satellite states in the data set is larger than or equal to 100 groups, and the more the data of the satellite states are, the more accurate the result trained by the model is.

The data of the multiple groups of satellite training state data sets are data of a training set, and simulation data can be adopted, or simulation data and real data can be combined. The time line within a time period comprises a plurality of time points, the state of the satellite at each time point is different, and different effects can be obtained when the orbit control strategy is executed at different time points. According to the scheme, through a plurality of satellite training state data sets, the initial time satellite inclination angle of each group of satellites corresponds to the satellite inclination angle of a time point, the time points corresponding to the initial time of each group of satellite training state data sets are different, namely the initial time of each group of satellite training state data sets is different.

S2: and inputting the initial time satellite inclination angle of a group of satellite training state data sets into the model to obtain all inclination angle control behaviors after the initial time and the corresponding output Q value.

And after the initial moment satellite executes the inclination angle control action, obtaining the inclination angle of the satellite at the next moment. And after the satellite executes the inclination angle control action at the next moment, obtaining the satellite inclination angle at the next moment. And by analogy, the inclination angle control behaviors at a plurality of next moments are obtained.

S3: and acquiring the inclination angle of the satellite at the current moment, and acquiring the inclination angle control behavior executed by the satellite according to a greedy strategy.

And in the initial circulation, the satellite inclination at the current moment is the satellite inclination at the initial moment.

The method for acquiring the inclination angle control action executed by the satellite according to the greedy strategy comprises the following steps: the satellite randomly selects the dip angle control behavior according to a first specified probability or executes the dip angle control behavior corresponding to the maximum Q value according to a second specified probability; the sum of the first specified probability and the second specified probability equals 1.

If the first specified probability is greater than the second specified probability, the method for obtaining the inclination angle control behavior executed by the satellite according to the greedy strategy adopts the following steps: randomly selecting a dip angle control behavior by the satellite according to a first designated probability;

if the second designated probability is greater than the first designated probability, the method for obtaining the inclination angle control behavior executed by the satellite according to the greedy strategy adopts the following steps: the satellite executes the inclination angle control action corresponding to the maximum Q value according to a second specified probability;

and if the first specified probability is equal to the second specified probability, selecting one of the methods for obtaining the inclination control action executed by the satellite according to the greedy strategy: the satellite randomly selects the inclination control action with a first designated probability or executes the inclination control action corresponding to the maximum Q value with a second designated probability.

The greedy policy is an epsilon-greedy policy.

The first assigned probability is ε, which decreases as the number of iterations increases.

The inclination angle control action executed by the satellite at the current moment is a _t 。

S4: executing the inclination angle control action to obtain the inclination angle of the satellite at the next moment; and obtaining the reward according to the satellite inclination angle and the north-south maintenance strategy reward function at the next moment.

The method for maintaining the inclination angle of the satellite is to set the inclination angle to be delta r _s Maintaining a circle for the radius tilt angle, allowing the tilt angle to drift continuously until the radius approaches Δ r _s When the inclination angle keeps the upper boundary of the circle, executing inclination angle maneuver to enable the inclination angle vector to jump to the lower boundary in the inclination angle keeping circle, wherein the inclination angle maneuver direction is basically along the opposite direction of the daily and monthly perturbation.

The goal of the satellite north-south maintenance strategy is to minimize fuel consumption while maintaining orbital inclination, so the reward strategy is defined to take into account both inclination variation and fuel consumption. The initial quality of the satellite is constant, and the total fuel consumption depends on the sum of the absolute values of the speed increment of each control, and the latter depends on the sum of the absolute values of the inclination vector change generated by each control.

Assuming that the satellite control frequency is fixed (i.e. orbit control is performed after a period of time for the fixation), it is expected that this control amount can ensure that the inclination angle is within the holding circle at the next control and the control amount is as small as possible, i.e. the inclination angle of the orbit at the next time is not only within the inclination holding circle, but also as close to the nominal orbit (theoretical orbit) as possible. The amount of change in the inclination at a certain time (time t) determines the state of the track inclination at the next time (time next to time t) from the extrapolated certain time. For this purpose, a reward strategy at time t is designed. Nominal orbit having a pitch angle s ₀ The dip angle maintaining circle radius is Deltar _s The reward strategy at the time t is a reward function of the north-south maintenance strategy, and the reward function adopts a formula 1:

wherein r is _t Reward, Δ s, for the tilt control action of the satellite at the current moment _t Is the difference in inclination angle, Δ s, at the next moment in time to the current moment in time _t ＝s _t+1 -s ₀ ，s ₀ Is the angle of inclination of the nominal track, s _t+1 The satellite inclination angle is the next moment of the current moment; the satellite inclination angle difference at the next moment of the current moment is | s _t+1 -s ₀ L, |; t is the current time, t ₀ The expected tracking control time closest to the current time is used.

the time t +1 is the time next to the time t (current time). The extrapolated tilt at time t +1 is the tilt at the next time to time t.

S5: and storing the satellite inclination angle at the current moment, the inclination angle control action executed by the satellite, the reward and the satellite inclination angle at the next moment into an experience pool as a group of satellite combination state data sets.

S6: and taking out a plurality of satellite combination state data sets from the experience pool, and calculating the target value of each satellite combination state data set according to the target neural network weight parameter.

The number of the satellite combination state data sets is m, m is a natural number larger than 0, and m is smaller than the number of the satellite training state data sets. The m groups of satellite combination state data sets are small-batch satellite combination state data sets. The number of the satellite combination state data sets is determined according to the number of the satellite training state data sets.

The method for calculating the target value of each satellite combination state data set according to the weight parameter of the target neural network adopts a formula 2:

wherein, y _j Representing the target value, gamma is the discount value, theta' is the target neural network weight parameter,

represents the maximum Q value, s, of the satellite performing the inclination control action a at the next moment in the set of satellite combined state data sets _j+1 Representing the satellite inclination of the next moment in a group of satellite combination state data sets, a representing the inclination control action executed by the satellite at the current moment in a group of satellite combination state data sets, r _j Representing a reward in a set of satellite constellation state data sets.

And ending the task to be the convergence of the model or the iteration. When s is _j+1 For model convergence or iteration completion, y _i Is equal to r _j (ii) a When s _j+1 When model convergence is not reached or iteration is completed, yi is equal to

The conditions for model convergence are: the error calculated by the loss function is within a specified range.

The conditions for the iteration to be completed are: all steps are executed.

S7: and calculating errors according to the loss function, and updating the weight parameters of the current neural network.

The loss function uses equation 3:

wherein, y _j Representing the target value, theta is the current weight parameter of the neural network, Q(s) _j ，a _j (ii) a Theta) represents the current time satellite in a set of satellite combined state data sets to perform the inclination control action a _j Value of Q after, s _j Representing the satellite inclination at the current moment in a set of satellite constellation state data sets, a _j Representing the tilt control actions performed by the satellite, and m is the number of satellite combined state data sets.

The error is the calculation result of the loss function using equation 3.

The current neural network weight parameters are updated by a Stochastic Gradient Descent (SGD) method.

r _t 、a _t 、s _t 、s _t+1 Samples in the data set representing the satellite training state data set, r _j 、a _j 、s _j 、s _j+1 Representing samples in an experience pool.

S5-S7 adjust the parameters of the model, so that the calculation accuracy of the model can be higher.

S8: and updating the Q value according to the value function, and taking the satellite inclination at the next moment as the current satellite inclination.

The method for updating the Q value according to the value function uses equation 4:

wherein, Q(s) at the left side of the arrow _t ，a _t ) Performing an inclination control action a of the satellite representing the updated current time _t The latter Q value, Q(s) on the right side of the arrow _t ，a _t ) Representation updateThe satellite at the current moment in time performs the inclination control action a _t Later Q value, Q(s) _t+1 ，a _t ) The satellite executes the inclination angle control action a at the next moment of the current moment before the update _t The latter Q value, alpha is the weight, gamma is the discount value, s _t Representing the satellite inclination at the current time, a _t Representing the inclination control action performed by the satellite at the current moment, s _t+1 Indicating the satellite inclination at the next moment in time from the current moment in time, rt indicating the reward.

Wherein both α and γ range between 0 and 1.

S9: repeating steps S3-S8, wherein the number of times steps S3-S8 are executed is equal to the expected orbit control number of times of the set of satellite state data; and after the steps S3-S8 of the appointed iteration times are repeatedly executed, updating the weight parameter of the target neural network according to the weight parameter of the current neural network.

And after the iteration of the specified iteration times T1 is completed, updating the target neural network weight parameter to the current neural network weight parameter. In this way, the correlation between the target value of the calculation target network and the current network parameter is reduced. The target network is updated at intervals of a certain step length C by copying the weight parameters of the current network to the target network, and the target Q value is kept unchanged in a period of time by the double-network structure, so that the correlation between the calculated target Q value and the current network parameters is reduced, and the convergence and the stability of the algorithm are improved.

According to the modeling method, satellite inclination angle data are used as input of a neural network model, a generated return value is used as output, a Nature DQN neural network is adopted, complex modeling is carried out without various perturbation forces applied to a satellite in an orbit operation process, deep reinforcement learning is directly adopted for learning and decision making, improvement is carried out based on a DQN algorithm, the method is suitable for training a large-scale neural network, the stability of the DQN algorithm is greatly improved, an optimal north-south control strategy can be obtained, and consumption of satellite fuel can be reduced, so that the method has important significance and value for practical aerospace application.

According to a second embodiment of the invention, the invention provides a satellite north-south conservation strategy model based on the Nature DQN, and the model is established by adopting the modeling method of the satellite north-south conservation strategy model based on the Nature DQN of the first embodiment.

According to a third specific embodiment of the invention, the invention provides a method for obtaining a satellite north-south conservation optimal strategy, which comprises the steps of establishing a satellite north-south conservation strategy model based on the Nature DQN by adopting the modeling method of the satellite north-south conservation strategy model based on the Nature DQN of the first embodiment;

and obtaining an optimal strategy according to the model.

wherein, pi represents the strategy of the satellite for inclination angle control, pi ^* Represents the optimal inclination control strategy learned by the model, i.e. the satellite passes through the strategy pi under the condition that the satellite inclination at the initial moment is s ^* Produces the greatest return under the tilt control behavior a.

According to a fourth embodiment of the present invention, there is provided an electronic device, as shown in fig. 2, and fig. 2 is a block diagram of an electronic device according to an exemplary embodiment.

An electronic device 200 according to this embodiment of the present application is described below with reference to fig. 2. The electronic device 200 shown in fig. 2 is only an example, and should not bring any limitation to the functions and the application range of the embodiments of the present application.

As shown in FIG. 2, electronic device 200 is embodied in the form of a general purpose computing device. The components of the electronic device 200 may include, but are not limited to: at least one processing unit 210, at least one memory unit 220, a bus 230 connecting different system components (including the memory unit 220 and the processing unit 210), a display unit 240, and the like.

Wherein the storage unit stores program code executable by the processing unit 210 to cause the processing unit 210 to perform the steps according to various exemplary embodiments of the present application described in the present specification. For example, the processing unit 210 may perform the steps as shown in fig. 1.

The memory unit 220 may include readable media in the form of volatile memory units, such as a random access memory unit (RAM) 2201 and/or a cache memory unit 2202, and may further include a read only memory unit (ROM) 2203.

The storage unit 220 may also include a program/utility 2204 having a set (at least one) of program modules 2205, such program modules 2205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Bus 230 may be one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 200 may also communicate with one or more external devices 200' (e.g., keyboard, pointing device, bluetooth device, etc.) such that a user can communicate with the devices with which the electronic device 200 interacts, and/or any device (e.g., router, modem, etc.) with which the electronic device 200 can communicate with one or more other computing devices. Such communication may occur through input/output (I/O) interfaces 250. Also, the electronic device 200 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet) via the network adapter 260. The network adapter 260 may communicate with other modules of the electronic device 200 via the bus 230. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 200, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware.

Thus, according to a fifth embodiment of the present invention, there is provided a computer readable medium. As shown in fig. 3, the technical solution according to the embodiment of the present invention can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to make a computing device (which can be a personal computer, a server, or a network device, etc.) execute the above-mentioned method according to the embodiment of the present invention.

The software product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The computer readable storage medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable storage medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

The computer-readable medium carries one or more programs which, when executed by a device, cause the computer-readable medium to carry out the functions of the first embodiment.

Those skilled in the art will appreciate that the modules described above may be distributed in the apparatus according to the description of the embodiments, or may be modified accordingly in one or more apparatuses unique from the embodiments. The modules of the above embodiments may be combined into one module, or further split into multiple sub-modules.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiment of the present invention can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which can be a personal computer, a server, a mobile terminal, or a network device, etc.) to execute the method according to the embodiment of the present invention.

The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes will occur to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A satellite north-south conservation strategy model modeling method based on Nature DQN is characterized by comprising the following steps:

s4: executing the inclination angle control action to obtain the inclination angle of the satellite at the next moment; obtaining reward according to the satellite inclination angle and the north-south maintenance strategy reward function at the next moment; the north-south retention policy reward function adopts formula 1:

wherein r is _t Reward, Δ S, for the tilt control action of the satellite at the current moment _t Is the difference in inclination angle, Δ s, at the next moment in time to the current moment in time _t ＝S _t+1 -S ₀ ，s ₀ Is the angle of inclination of the nominal track, s _t+1 The satellite inclination angle is the next moment of the current moment; satellite tilt at the next time of the current timeAngular difference is | s _t+1 -s ₀ L, |; t is the current time, t ₀ The expected track control time closest to the current time is obtained;

s9: repeating steps S3-S8, wherein the number of times of executing steps S3-S8 is equal to the expected orbit control number of times of the set of satellite training state data; after the steps S3-S8 of the appointed iteration times are executed repeatedly, updating the weight parameter of the target neural network according to the weight parameter of the current neural network;

2. The modeling method of the satellite north-south conservation strategy model based on Nature DQN according to claim 1, wherein in step S1, the satellite inclination is a two-dimensional inclination of the satellite on the orbit, the two-dimensional inclination is obtained from the satellite orbit inclination and the ascent point right ascension;

s＝(i _x ，i _y )；

3. The modeling method of satellite north-south conservation strategy model based on Nature DQN according to claim 1, wherein in step S3, the method of obtaining tilt control actions performed by a satellite according to greedy strategy includes: the satellite randomly selects the dip angle control behavior according to a first specified probability or executes the dip angle control behavior corresponding to the maximum Q value according to a second specified probability; the sum of the first specified probability and the second specified probability equals 1.

4. The modeling method of satellite north-south conservation strategy model based on Nature DQN according to claim 1, wherein in step S6, the method for calculating the target value of each satellite combination state data set according to the target neural network weight parameter uses formula 2:

represents the maximum Q value, s, of the satellite performing the inclination control action a at the next moment in the set of satellite combination state data sets _j+1 Representing the satellite inclination of the next moment in a group of satellite combination state data sets, a representing the inclination control action executed by the satellite at the current moment in a group of satellite combination state data sets, r _j Representing a reward in a set of satellite constellation state data sets.

5. The method for modeling a satellite north-south conservation strategy model based on Nature DQN according to claim 1, wherein in step S7, the loss function adopts formula 3:

wherein, y _j Representing a target valueTheta is the current neural network weight parameter, Q(s) _j ，a _j (ii) a Theta) represents the current time satellite in a set of satellite combined state data sets to perform the inclination control action a _j Value of Q after, s _j Representing the satellite inclination at the current moment in a set of satellite constellation state data sets, a _j Representing the tilt control actions performed by the satellite, and m is the number of satellite combined state data sets.

6. The method for modeling a satellite north-south conservation strategy model based on Nature DQN according to claim 1, wherein in step S8, the method for updating Q value according to value function adopts formula 4:

Q(s _t ，a _t )←Q(s _t ，a _t )+α[r _t +γmax Q(s _t+1 ，a _t )-Q(s _t ，a _t )](4)；

wherein, Q(s) at the left side of the arrow _t ，a _t ) Performing an inclination control action a of the satellite representing the updated current time _t The latter Q value, Q(s) on the right side of the arrow _t ，a _t ) The satellite representing the current time before updating executes the inclination control action a _t Later Q value, Q(s) _t+1 ，a _t ) The satellite executes the inclination angle control action a at the next moment of the current moment before the update _t The latter Q value, alpha is the weight, gamma is the discount value, s _t Representing the satellite inclination at the current moment, a _t Representing the inclination control action, s, performed by the satellite at the current moment _t+1 The satellite inclination, r, at the next moment in time representing the current moment in time _t Indicating a reward.

7. A satellite north-south conservation strategy model based on Nature DQN, wherein the model is established using the modeling method of any one of claims 1-6.

8. A method for acquiring satellite north-south maintaining optimal strategy, which is characterized in that a satellite north-south maintaining strategy model based on Nature DQN is established according to the modeling method of any one of claims 1-6;

obtaining an optimal strategy according to the model;

wherein, pi represents the strategy of the satellite for inclination angle control, pi ^* Representing the optimal inclination control strategy learned by the model, i.e. the satellite passes through the strategy pi with the satellite inclination s at the initial moment ^* Produces the greatest return under the tilt control behavior a.

9. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-6.

10. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-6.