CN113325704B

CN113325704B - Spacecraft backlighting approaching intelligent orbit control method, device and storage medium

Info

Publication number: CN113325704B
Application number: CN202110450164.4A
Authority: CN
Inventors: 袁利; 黄煌; 韩冬; 石恒; 魏春岭
Original assignee: Beijing Institute of Control Engineering
Current assignee: Beijing Institute of Control Engineering
Priority date: 2021-04-25
Filing date: 2021-04-25
Publication date: 2023-11-10
Anticipated expiration: 2041-04-25
Also published as: CN113325704A

Abstract

The embodiment of the invention provides a spacecraft backlighting approaching intelligent orbit control method, which comprises the following steps: establishing a kinematic model of the motion trail of the spacecraft and the motion trail of the target spacecraft in a simulation environment according to a kepler orbit dynamics method; obtaining observed quantity of the own spacecraft and the target spacecraft at the time t0 and speed increment of the target spacecraft at the time t0 from the kinematic model; calculating the speed increment of the self-spacecraft at the time t0 by using an action network with the convergence of the observed quantity input training effect at the time t0 of the self-spacecraft, and controlling the orbit of the self-spacecraft according to the speed increment; judging whether the own spacecraft is in the backlighting observation range of the target spacecraft after the speed increment is subjected to orbit control according to the observed quantity and the azimuth angle of the own spacecraft and the target spacecraft at the time t0+T. The embodiment of the invention can realize the judgment of the backlight observation range among the spacecrafts.

Description

Spacecraft backlighting approaching intelligent orbit control method, device and storage medium

Technical Field

The invention relates to the technical field of spacecraft orbit control, in particular to a spacecraft backlighting approaching intelligent orbit control method, device and storage medium.

Background

In space tasks, it is necessary to achieve close-up detailed investigation of space objects on the one hand, and on the other hand, to protect itself as much as possible, and to reduce itself exposure as much as possible. How to achieve track control quickly and efficiently in space tasks is a great technical problem facing today.

Disclosure of Invention

In view of the above technical problems, the embodiments of the present invention provide a method, an apparatus, and a storage medium for controlling a spacecraft backlighting approach intelligent orbit, so as to achieve rapid backlighting approach of a spacecraft.

The invention solves the technical problems by the following technical proposal:

a spacecraft backlighting approaching intelligent orbit control method comprises the following steps:

establishing a kinematic model of the motion trail of the spacecraft and the motion trail of the target spacecraft in a simulation environment according to a kepler orbit dynamics method;

obtaining observed quantity of the own spacecraft and the target spacecraft at the time t0 and speed increment of the target spacecraft at the time t0 from the kinematic model, wherein the observed quantity comprises the following components: position information and speed information;

calculating the speed increment of the self-spacecraft at the time t0 by using an action network with the convergence of the observed quantity input training effect at the time t0 of the self-spacecraft, and controlling the orbit of the self-spacecraft according to the speed increment;

Inputting observed quantity and speed increment of the self spacecraft and the target spacecraft at the time T0 into the respective kinematic model to calculate observed quantity of the self spacecraft and the target spacecraft at the time t0+T;

judging whether the self spacecraft is in a backlight observation range of the target spacecraft after orbit control according to the speed increment at the time T0 according to the observed quantity of the self spacecraft and the target spacecraft at the time T0 and the solar azimuth angle at the time T0 and the azimuth angle between the target spacecraft and the self spacecraft at the time T0, wherein the backlight observation range of the target spacecraft comprises: the self spacecraft is positioned between the target spacecraft and the sun, and the azimuth angle between the target spacecraft and the self spacecraft is smaller than a preset value.

The embodiment of the invention provides an intelligent reinforcement learning orbit control device for spacecraft backlighting, which comprises the following components:

the kinematic model building module is used for building a kinematic model of the motion trail of the spacecraft and the motion trail of the target spacecraft in a simulation environment according to a kepler orbit dynamics method;

the speed increment obtaining module is used for obtaining observed quantity of the own spacecraft and the target spacecraft at the time t0 and speed increment of the target spacecraft at the time t0 from the kinematic model;

The action network calculation module is used for calculating the speed increment of the self-spacecraft at the time t0 by using an action network which is used for converging the observed quantity input training effect at the time t0 of the self-spacecraft, and controlling the orbit of the self-spacecraft according to the speed increment;

the observed quantity calculation module is used for inputting observed quantity and speed increment of the self spacecraft and the target spacecraft at the time T0 into the respective kinematic model to calculate observed quantity of the self spacecraft and the target spacecraft at the time t0+T;

the backlight observation judging module is used for judging whether the own spacecraft is in the backlight observation range of the target spacecraft after orbit control according to the speed increment at the time T0 according to the observed quantity of the own spacecraft and the target spacecraft at the time t0+T, the solar azimuth angle at the time t0+T and the azimuth angle between the target spacecraft and the own spacecraft at the time t0+T, wherein the backlight observation range of the target spacecraft comprises: the self spacecraft is positioned between the target spacecraft and the sun, and the azimuth angle between the target spacecraft and the self spacecraft is smaller than a preset value.

An embodiment of the present invention provides a nonvolatile storage medium including: a software program which, when executed, performs the above method.

In the embodiment of the invention, the speed increment of the own spacecraft and the target spacecraft at the time T0 is obtained, the own spacecraft is respectively subjected to orbit control according to the speed increment, the target spacecraft is subjected to orbit control in the simulation environment, the distance between the own spacecraft and the target spacecraft is obtained after a control period T, namely the time t0+T, and whether the own spacecraft is in the backlighting observation range of the target spacecraft is judged according to the distance. In the prior art, the spacecraft needs to interact with the ground communication system to obtain the observed quantity and the speed increment of the target spacecraft, so that time delay is generated. Both the own spacecraft and the target spacecraft can generate observed quantity and speed increment change in the time delay period, and the obtained observed quantity and speed increment are inaccurate due to the change, so that the orbit control and evaluation are carried out according to the inaccurate quantity, and the backlight approaching of the spacecraft cannot be realized. By utilizing the technical scheme provided by the embodiment of the invention, the observed quantity and the speed increment of the own spacecraft and the target spacecraft can be acquired in real time on the track, and the time delay is avoided, so that the backlight approaching of the spacecraft can be rapidly and accurately realized.

Drawings

FIG. 1 is a schematic flow chart of a method for controlling a back-light approach intelligent orbit of a spacecraft provided by an embodiment of the invention;

fig. 2 is a schematic structural diagram of a spacecraft backlight approaching intelligent orbit control device according to an embodiment of the invention;

FIG. 3 is a graph depicting the relationship between distance and prize function values for backlight approaching provided by an embodiment of the invention;

FIG. 4 is a graph depicting the relationship between azimuth angle and the value of the bonus function when backlight approaches, according to an embodiment of the invention;

FIG. 5 is a graph showing the comparison of convergence curves of the cumulative prize values for different learning parameters according to the embodiment of the present invention;

FIG. 6 is a schematic diagram of a trajectory of a spacecraft approaching from a backlight direction provided by an embodiment of the invention;

Detailed Description

In an embodiment of the present invention, for an active maneuvering target, if the relative position information of the target is also downloaded to the ground by means of a communication link, the ground performs orbital maneuver strategy calculation and then uploads to a satellite, which obviously makes it difficult to cope with a scene with high real-time characteristics. Therefore, there is a need to design orbital maneuver strategies that can be autonomously calculated on-orbit, without the need for ground intervention. Therefore, the embodiment of the invention provides a reinforcement learning intelligent orbit control method for the backlight approaching of a spacecraft, which is used for calculating the speed increment of the spacecraft on line in real time according to the relative position relation and the relative speed of the spacecraft by establishing and training a spacecraft action network and an evaluation network of the spacecraft so as to continuously maintain the backlight angle in the target approaching process, thereby obtaining the target appearance information under the favorable illumination condition and protecting the appearance information of the spacecraft from being exposed.

Fig. 1 is a schematic flow chart of a method for controlling a backlight approaching intelligent orbit of a spacecraft, which is provided by the embodiment of the invention, and as shown in fig. 1, the method comprises the following steps:

step 101, establishing a kinematic model of a motion track of a spacecraft and a motion track of a target spacecraft in a simulation environment according to a kepler orbit dynamics method;

step 102, obtaining observed quantities of the own spacecraft and the target spacecraft at a time t0 and speed increment of the target spacecraft at the time t0 from the kinematic model, wherein the observed quantities comprise: position information and speed information;

step 103, calculating the speed increment of the self-spacecraft at the time t0 by using an action network with convergent observed quantity input training effect at the time t0 of the self-spacecraft, and controlling the orbit of the self-spacecraft according to the speed increment;

104, inputting observed quantities and speed increment of the self spacecraft and the target spacecraft at the time T0 into the respective kinematic models to calculate observed quantities of the self spacecraft and the target spacecraft at the time t0+T;

step 105, judging whether the own spacecraft is in the backlighting observation range of the target spacecraft after orbit control according to the speed increment at the time T0 according to the observed quantity of the own spacecraft and the target spacecraft at the time T0+T, the solar azimuth angle at the time T0+T and the azimuth angle between the target spacecraft and the own spacecraft at the time T0+T, wherein the backlighting observation range of the target spacecraft comprises: the self spacecraft is positioned between the target spacecraft and the sun, and the azimuth angle between the target spacecraft and the self spacecraft is smaller than a preset value.

In the embodiment of the invention, the method is used in the spacecraft backlight approaching evaluation process. The observed quantity of the own spacecraft and the target spacecraft can be obtained through a kinematic model of the motion track established according to the Kepler orbit dynamics method in a simulation environment, the observed quantity of the own spacecraft and the target spacecraft is input into an action network with converged respective training effects, the speed increment at the time t0 can be calculated, and the on-orbit of the own spacecraft can be controlled according to the speed increment of the own spacecraft. By utilizing the technical scheme of the invention, the backlight approaching of the spacecraft can be rapidly and accurately realized.

According to the technical scheme, a large number of samples are used for learning and training, an initial action network and an initial evaluation network of the self-spacecraft are trained and adjusted in the simulation environment to obtain the action network and the evaluation network with converged training effects, the speed increment of the self-spacecraft can be obtained through the training network, the instant reward function at corresponding moments is subjected to limiting and grading design, and according to the observed quantity at multiple moments, the speed increment, the instant reward function and the learning rate of MADDPG algorithm, the discount factor of long-term return, the training times of each time and the data quantity of batch learning, the initial action network and the initial evaluation network of the self-spacecraft are trained and adjusted in the simulation environment to obtain the action network and the evaluation network with updated weight values smaller than a preset value, and whether the self-spacecraft is in the inverse observation range of the target spacecraft or not can be judged according to the observed quantity of the self-spacecraft at the moment t0+T, the solar azimuth angle at the moment t0+T and the azimuth angle between the target spacecraft and the self-spacecraft at the moment T. In the embodiment of the invention, the method for judging that the update amount of the weight is smaller than the preset value comprises the following steps: and continuously acquiring expected output values and actual output values of the initial action network and the initial evaluation network at a plurality of moments, inputting the expected output values and the actual output values at each moment into a cost function to obtain a difference value, and judging that the weight updating amount is smaller than a preset value when the difference value corresponding to each moment in the plurality of moments is smaller than a preset value.

In an embodiment of the invention, observed quantities of the self spacecraft and the target spacecraft at a plurality of moments and speed increment of the self spacecraft corresponding to the plurality of moments are obtained;

determining an instant reward function of the own spacecraft at the corresponding moment according to the distance between the own spacecraft and the target spacecraft at each moment in the plurality of moments, the solar azimuth angle and the azimuth angle between the target spacecraft and the own spacecraft;

training and adjusting an initial action network and an initial evaluation network of the self spacecraft in the simulation environment according to observed quantity, speed increment, instant rewarding function and MADDPG algorithm corresponding to each moment to obtain the action network and the evaluation network with converged training effect, wherein the weight updating quantity of the action network and the evaluation network with converged training effect is smaller than a preset value, and the method further comprises the following steps:

performing amplitude limiting and grading design on the instant reward function at the corresponding moment, acquiring the distance between the own spacecraft and the target spacecraft according to the observed quantity of the own spacecraft and the target spacecraft at the moment t0+T, and ending the process and giving the own spacecraft constant punishment quantity-r 1 when the distance exceeds the maximum distance D; when the distance is not greater than the maximum distance D, setting a distance grading rewarding turning point L1, and setting a parameter value according to the grading rewarding turning point L1, wherein the parameter value is specifically as follows:

Setting a distance grading rewarding turning point L2, setting a parameter value according to the grading rewarding turning point L2, and under a global coordinate system, according to an azimuth angle vector S= [ sx, sy of a given sun]Calculating azimuth angle vector of target spacecraft relative to own spacecraftThe method comprises the following steps:

wherein p is _ax And p _ay Respectively representing the position information of the target spacecraft in the x and y directions; p is p _bx And p _by The method comprises the steps of respectively representing the position information of a spacecraft in the x direction and the y direction, wherein Sx is the component of a solar azimuth unit vector in the x direction, sy is the component of the solar azimuth unit vector in the y direction, and the instant reward function of the spacecraft is determined as follows: r=r ₁ +R ₂ 。

Aiming at the control of the backlight approaching orbit of the spacecraft, the invention creatively provides the hierarchical design idea of the winning function of the reinforcement learning method. The provided rewarding function considers the relative distance and the target azimuth angle at the same time, classifying points are respectively arranged, when the distance is smaller than the classifying points, the rewarding value is rapidly increased along with the decrease of the distance, and when the distance is larger than the classifying points, the rewarding value is slowly increased along with the decrease of the distance; when the angle between the azimuth and the sun is smaller than the demarcation point, the rewarding value is increased rapidly along with the decrease of the angle, and when the angle between the azimuth and the sun is larger than the demarcation point, the rewarding value is increased slowly along with the decrease of the angle. Through the design of the grading reward function, the convergence speed can be effectively improved, so that the spacecraft can converge towards the direction of large reward value in a more directional manner in the strategy exploration process. The conventional linearization reward function can cause the spacecraft to be excessively sensitive in the strategy exploration process, the convergence process has obvious oscillation, and the sparse reward function which is not 0 or 1 can cause the spacecraft to stay in an ineffective exploration stage for a long time in the strategy exploration process, so that the convergence is slow.

In an embodiment of the present invention, training and adjusting the initial action network and the initial evaluation network of the own spacecraft in the simulation environment by the network training module to obtain the action network and the evaluation network with converged simulation effect includes:

according to the observed quantity, the speed increment, the instant reward function and the learning rate of the MADDPG algorithm, the long-term return discount factor, the training times of each time and the batch learning data quantity of the spacecraft in the simulation environment, training and adjusting an initial action network and an initial evaluation network of the spacecraft to obtain the action network and the evaluation network with the weight update quantity smaller than a preset value, wherein the specific steps are as follows:

step 1: initializing initial orbit parameters and initial positions of the self spacecraft and the target spacecraft, wherein the self spacecraft and the target spacecraft are positioned in the same orbit plane, and the initial distance is smaller than D;

step 2: initializing the initial action network and the initial evaluation network;

step 3: acquiring observed quantity of the own spacecraft at the time t1 in a gym simulation environment, carrying out normalization processing, inputting a processing result into the initial action network, and acquiring speed increment of the own spacecraft at the time t 1;

Step 4: performing orbit control on the spacecraft in the kym simulation environment according to the speed increment at the moment t 1;

step 5: calculating an instant rewarding value R at the time of t1+T according to the instant rewarding function of the own spacecraft;

step 6: generating a training sample from the calculation result and storing the training sample in a sample library; wherein the training sample comprises: observing quantity, control quantity and instant rewards before and after track control;

step 7: repeatedly executing the steps 3 to 6 at different moments, and acquiring training samples at different moments until the number of the training samples in the sample library reaches a preset value;

step 8: training the initial action network and the initial evaluation network according to the training sample;

step 9: repeating the steps 3 to 8 until the training of the current scene is completed;

step 10: and returning to the step 1, re-initializing the initial positions of the own spacecraft and the target spacecraft, and continuing training of the new scene until the weight updating amount of the action network and the evaluation network is smaller than a preset value.

The network training module randomly extracts n training samples from the sample library, calculates an instant rewarding value R corresponding to each observed quantity in each training sample, calculates an expected value of long-term rewards corresponding to each observed quantity according to a long-term rewarding discount factor and the instant rewarding value R, outputs the expected value as a hope, and updates the weight of the initial evaluation network to obtain a first evaluation network;

And determining the weight updating quantity of the initial action network according to the weight of the first evaluation network and the learning rate, updating the initial action network according to the weight updating quantity, returning to the operation of randomly extracting n training samples from the sample library until training of the action network and the evaluation network is completed to obtain the action network and the evaluation network with convergent training effects, and completing training.

Wherein, the observed quantity includes: position information and velocity information.

Fig. 2 is a schematic structural diagram of a spacecraft backlighting approaching intelligent orbit control device according to an embodiment of the invention. As shown in fig. 2, the apparatus includes:

the system comprises a kinematic model establishment module 201, a speed increment acquisition module 202, an action network calculation module 203, an observed quantity calculation module 204 and a backlight observation judgment module 205.

The kinematic model establishing module 201 is configured to establish a kinematic model of a motion track of a spacecraft and a motion track of a target spacecraft in a simulation environment according to a kepler orbit dynamics method;

a speed increment obtaining module 202, configured to obtain observed amounts of the own spacecraft and the target spacecraft at time t0 and a speed increment of the target spacecraft at time t0 from the kinematic model;

The action network calculation module 203 is configured to calculate a speed increment of the own spacecraft at a time t0 according to an action network that converges an observed quantity input training effect at the time t0 of the own spacecraft, and control an orbit of the own spacecraft according to the speed increment;

the observed quantity calculation module 204 is configured to input observed quantities and speed increments of the own spacecraft and the target spacecraft at time T0 into the respective kinematic models to calculate observed quantities of the own spacecraft and the target spacecraft at time t0+t;

the backlight observation judging module 205 is configured to judge whether the own spacecraft is within a backlight observation range of the target spacecraft after performing orbit control according to the speed increment at time T0 according to the observed quantity of the own spacecraft and the target spacecraft at time t0+t, the solar azimuth angle at time t0+t, and the azimuth angle between the target spacecraft and the own spacecraft at time t0+t, where the target spacecraft backlight observation range is determined according to specific situations.

The apparatus further comprises: the network training module 206 is configured to obtain observed amounts of the own spacecraft and the target spacecraft at multiple moments, and speed increments of the own spacecraft corresponding to the multiple moments;

according to observed quantity, speed increment, instant rewarding function and MADDPG algorithm corresponding to each moment in the plurality of moments, training and adjusting an initial action network and an initial evaluation network of the spacecraft in the simulation environment to obtain the action network and the evaluation network with converged training effects, wherein the weight updating quantity of the action network and the evaluation network with converged training effects is smaller than a preset value.

Further comprises: an instant prize function design module 207, configured to clip and stage the instant prize function at a corresponding time, including:

obtaining the distance between the own spacecraft and the target spacecraft according to the observed quantity of the own spacecraft and the target spacecraft at the time t0+T;

when the distance exceeds the maximum distance D, ending the process and giving the self spacecraft constant punishment quantity-r 1;

when the distance is not greater than the maximum distance D, setting a distance grading rewarding turning point L1, and setting a parameter value according to the grading rewarding turning point L1, wherein the parameter value is specifically as follows:

wherein p is _ax And p _ay Respectively representing the position information of the target spacecraft in the x and y directions; p is p _bx And p _by The position information of the spacecraft in the x direction and the y direction are respectively represented, sx is the component of the solar azimuth unit vector in the x direction, and Sy is the component of the solar azimuth unit vector in the y direction;

determining the instant reward function of the spacecraft as: r=r ₁ +R ₂ 。

In an embodiment of the present invention, the training module 206 performs training adjustment on the initial action network and the initial evaluation network of the own spacecraft in the simulation environment to obtain an action network and an evaluation network with converged simulation effect, where the action network and the evaluation network include:

and training and adjusting the initial action network and the initial evaluation network of the spacecraft in the simulation environment according to the observed quantity, the speed increment, the instant reward function and the learning rate of the MADDPG algorithm, the long-term return discount factor, the training times of each time and the batch-learned data quantity at multiple moments to obtain the action network and the evaluation network with the weight updating quantity smaller than a preset value.

In one embodiment of the present invention, the network training module 206 is further configured to execute

In an embodiment of the present invention, the network training module 206 is further configured to

Randomly extracting n training samples from the sample library;

calculating an instant rewarding value R corresponding to each observed quantity in each training sample;

calculating expected values of long-term rewards corresponding to each observed quantity according to the long-term rewards discount factors and the instant rewards value R, outputting the expected values as expected values, and updating the weight of the initial evaluation network to obtain a first evaluation network;

The invention provides a spacecraft backlighting intelligent orbit control method, which comprises the following specific steps:

(1) In a two-dimensional plane, a kinematic model of a spacecraft and a target is built based on kepler orbit dynamics, a scene is built in a gym, and visualization is realized;

(2) The observed quantity and the control quantity are designed for the spacecraft, and the observed quantity of the spacecraft comprises: the self position, the speed, the target position and the speed, and the control quantity is the speed increment in a 2-dimensional plane;

(3) In order to improve learning efficiency and avoid excessive sparseness of reward values, which cannot effectively guide learning process, the spacecraft is designed with instant reward functions, and the reward functions are required to be limited and designed in a grading manner. The specific method comprises the following steps:

(3-1) when the distance between the spacecraft and the target exceeds a certain maximum distance D, the training is considered to be finished, and the spacecraft obtains a constant punishment quantity-r 1;

(3-2) when the distance between the spacecraft and the target is smaller than D, designing a graded rewarding turning point L1 with respect to the distance, when the distance between the spacecraft and the target D is satisfied: when L1< D < D, the reward value is smaller and slowly changes along with the distance, and when the distance D between the spacecraft and the target meets the following conditions: when d < L1, the prize value increases rapidly and increases rapidly with distance. The specific expression is as follows:

Wherein p is the position of the spacecraft, the subscript a corresponds to the target spacecraft, the subscript b corresponds to the approaching spacecraft, the subscript x corresponds to the x direction, and the subscript y corresponds to the y direction;

in practical training, d=150 km, l1≡75km is chosen.

FIG. 3 is a graph showing the relationship between distance and the value of the bonus function when backlight approaches, as shown in FIG. 3, and shows the change of R1 when x is changed in the interval of [0,1 ]. As can be seen, the rate of change of R1 with x varies significantly around x=0.5;

(3-3) when the spacecraft is less than the target distance D, designing a graded bonus turning point L2 about the target direction angle, under the global coordinate system, giving the azimuth vector (unit vector) s= [ sx, sy of the sun]Calculating azimuth vector of target relative to spacecraftWhen the included angle cosine between S and A is smaller than L2, the rewarding function is smaller and changes slowly, and when the included angle cosine between S and A is larger than L2, the rewarding function is larger and increases rapidly along with the included angle cosine. The specific expression is as follows:

in actual training, l2=0.75 is selected;

FIG. 4 is a graph showing the relationship between azimuth angle and the value of the bonus function when the backlight is approaching, as shown in FIG. 4, and the change of R2 is shown when x is changed within the interval of [ -1,1 ]. It can be seen that, before and after x=0.75, the rate of change of R2 with x is significantly changed, and at this time, the included angle between the corresponding S and a is smaller than 40 degrees;

Finally, the instant reward function for a spacecraft is: r=r1+r2;

(4) Designing action networks and evaluation networks for the spacecraft, and designing super parameters of a DDPG algorithm, wherein the super parameters comprise learning rate, long-term return discount factors, training times each time and batch learning data quantity;

(5) The learning training is carried out according to the following steps:

initializing orbit parameters and initial positions of a spacecraft and a target, wherein the spacecraft and the target are positioned in the same orbit plane, and the initial distance is smaller than D;

(5-2) initializing an action network and an evaluation network of the approaching spacecraft;

(5-3) obtaining observed quantity of the spacecraft from the gym simulation environment, carrying out normalization processing, and inputting the observed quantity into a spacecraft action network to obtain control quantity approaching the spacecraft, namely speed increment;

(5-4) performing the velocity increment in a gym simulation environment, the spacecraft orbiting;

(5-5) obtaining observed quantity of the spacecraft and the target again from the gym simulation environment, and calculating an instant reward R according to the formula in the step 3;

(5-6) generating a training sample to be placed in a sample library, wherein the training sample comprises the last observed quantity, the control quantity, the instant rewards and the next observed quantity;

(5-7) repeating the steps (5-3) to (5-6) until the number of samples in the sample library reaches a certain value;

(5-8) starting training the action network and the evaluation network of the spacecraft:

i. firstly, randomly extracting n samples from a sample library;

calculating a long-term rewarding value corresponding to the observed quantity in each sample according to the data stored in the sample;

training an evaluation network of the spacecraft by taking the long-term rewarding value as a desired output;

iv, updating the action network weight with the evaluation network weight;

(5-9) continuously repeating the steps (3) to (8) until the training times of the current scene are finished;

(5-10) returning to the step (1), re-initializing the initial positions of the spacecraft and the target, and continuing training of the new scene until the network converges.

(6) After training is completed, a new scene is initialized again at random, and the training effect of the backlight approaching algorithm is verified.

The specific parameters of approaching the spacecraft and the target are as follows:

six initial tracks: [6378+440,0.00001,0, pi,0];

control period: 2 seconds;

maximum speed increment: 0.0002km/s;

task permission time: 2000sec;

action network structure: 2 hidden layers, 64 nodes in each layer;

evaluating the network structure: 2 hidden layers, 64 nodes in each layer;

network middle layer activation function: reLU;

network output layer activation function: reLU;

The expression of the ReLU function is:

y(x)＝max(0,x)+min(0,x)

where x is the input of the output node, y is the output of the output node, max (0, x) is the larger value of 0 and x, and min (0, x) is the smaller value of 0 and x.

FIG. 5 is a graph showing a comparison of convergence curves of the cumulative prize values for different learning parameters according to an embodiment of the present invention, and the convergence curves of the different learning parameters are shown in FIG. 5. Fig. 6 is a schematic diagram of a track of a spacecraft approaching from a backlight direction, and as shown in fig. 6, a track of a training model for driving the spacecraft to fight against is provided, and it can be seen that training is performed by using the method provided by the invention, and the obtained model can approach a target spacecraft from the backlight direction.

Compared with the prior art, the invention has the advantages that:

(1) The invention provides a reinforced learning intelligent orbit control method for spacecraft backlight approaching, which firstly provides learning and training of an orbit control strategy for spacecraft backlight approaching by using a deep reinforced learning method, and firstly provides general steps of backlight approaching task reinforced learning and a graded reward function design method. Different from the traditional orbit control method, the orbit control method directly generates an orbit change strategy based on the target observed quantity, and establishes an environment single/multi-target spacecraft and single/multi-neighbor spacecraft model in a data feature extraction mode, wherein the environment single/multi-target spacecraft and single/multi-neighbor spacecraft model comprises an orbit model of a target and a relative relation model between the spacecraft and the target, so that the orbit control method has better adaptability to the orbit maneuver, the increase and decrease of the number and the like of the target in a dynamically-changed space environment. At present, no deep reinforcement learning method is used for backlight approaching orbit control of a spacecraft in publicly reported documents, patents or news; the spacecraft is adopted to learn and train the observation data of the environment, and the observation data can be obtained through communication or measurement means in practice. In addition, in the established digital simulation environment, most of learning and training processes can be completed without large-scale actual on-orbit training, and the method has engineering practicability;

An embodiment of the present invention provides a nonvolatile storage medium including: a software program which when executed performs the method described in fig. 1.

Although the present invention has been described in terms of the preferred embodiments, it is not intended to be limited to the embodiments, and any person skilled in the art can make any possible variations and modifications to the technical solution of the present invention by using the methods and technical matters disclosed above without departing from the spirit and scope of the present invention, so any simple modifications, equivalent variations and modifications to the embodiments described above according to the technical matters of the present invention are within the scope of the technical matters of the present invention.

Claims

1. The method for controlling the backlighting approaching intelligent orbit of the spacecraft is characterized by comprising the following steps of:

judging whether the self spacecraft is in a backlight observation range of the target spacecraft after orbit control according to the speed increment at the time T0 according to the observed quantity of the self spacecraft and the target spacecraft at the time T0 and the solar azimuth angle at the time T0 and the azimuth angle between the target spacecraft and the self spacecraft at the time T0, wherein the backlight observation range of the target spacecraft comprises: the self spacecraft is positioned between the target spacecraft and the sun, and the azimuth angle between the target spacecraft and the self spacecraft is smaller than a preset value;

obtaining observed quantities of the self spacecraft and the target spacecraft at a plurality of moments and speed increment of the self spacecraft at a plurality of corresponding moments;

training and adjusting an initial action network and an initial evaluation network of the self spacecraft in the simulation environment according to observed quantity, speed increment, instant rewarding function and MADDPG algorithm corresponding to each moment to obtain the action network and the evaluation network with converged training effect, wherein the weight updating quantity of the action network and the evaluation network with converged training effect is smaller than a preset value;

the method comprises the steps of obtaining the distance between a self spacecraft and a target spacecraft according to observed quantity of the self spacecraft and the target spacecraft at the time t0+T;

setting a target direction angle grading rewarding turning point L2, setting a parameter value according to the target direction angle grading rewarding turning point L2, and under a global coordinate system, according to azimuth angle vectors S= [ sx, sy of a given sun ]Calculating azimuth angle vector of target spacecraft relative to own spacecraftConcrete embodimentsThe method comprises the following steps:

2. The method of claim 1, wherein training the initial action network and the initial evaluation network of the own spacecraft in the simulation environment to obtain an action network and an evaluation network with converged simulation effect comprises:

3. The method of claim 2, wherein training the initial action network and the initial evaluation network of the own spacecraft in the simulation environment to obtain the action network and the evaluation network with the weight update amount smaller than the preset value comprises:

4. A method according to claim 3, wherein training the initial action network and the initial evaluation network according to the training samples in step 8 comprises:

randomly extracting n training samples from the sample library;

5. An intelligent orbit control device for backlighting of a spacecraft, which is characterized by comprising:

the speed increment obtaining module is used for obtaining observed quantity of the own spacecraft and the target spacecraft at the time t0 and speed increment of the target spacecraft at the time t0 from the kinematic model, wherein the observed quantity comprises the following components: position information and speed information;

the backlight observation judging module is used for judging whether the own spacecraft is in the backlight observation range of the target spacecraft after orbit control according to the speed increment at the time T0 according to the observed quantity of the own spacecraft and the target spacecraft at the time t0+T, the solar azimuth angle at the time t0+T and the azimuth angle between the target spacecraft and the own spacecraft at the time t0+T, wherein the backlight observation range of the target spacecraft comprises: the self spacecraft is positioned between the target spacecraft and the sun, and the azimuth angle between the self spacecraft and the target spacecraft is smaller than a preset value;

Further comprises: the network training module is used for acquiring observed quantities of the self spacecraft and the target spacecraft at a plurality of moments and speed increment of the self spacecraft at a plurality of corresponding moments;

further comprises: the instant rewarding function design module is used for:

setting a target direction angle grading rewarding turning point L2, setting a parameter value according to the target direction angle grading rewarding turning point L2, and under a global coordinate system, according to azimuth angle vectors S= [ sx, sy of a given sun]Calculating azimuth angle vector of target spacecraft relative to own spacecraftThe method comprises the following steps:

6. The apparatus of claim 5, wherein the network training module is further to:

7. The apparatus of claim 6, wherein the device comprises a plurality of sensors,

the network training module is further used for executing

8. The apparatus of claim 7, wherein the network training module is further configured to

Randomly extracting n training samples from the sample library;

9. A non-volatile storage medium, comprising: software program which, when executed, performs the method of any one of the preceding claims 1-4.