CN110991712B

CN110991712B - Planning method and device for space debris removal task

Info

Publication number: CN110991712B
Application number: CN201911146850.1A
Authority: CN
Inventors: 杨家男; 侯晓磊; 冯乾; 苏笑宇; 刘勇; 潘泉
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2019-11-21
Filing date: 2019-11-21
Publication date: 2023-04-25
Anticipated expiration: 2039-11-21
Also published as: CN110991712A

Abstract

The invention discloses a planning method and a planning device for a space debris removal task, which acquire space debris information to be removed and spacecraft state information; constructing a reinforcement learning search tree model according to the space debris information to be cleaned and the spacecraft state information; wherein the state quantity, action and benefit values are included in the reinforcement learning search tree model; generating actions by adopting an uplink tree searching method according to the initial state of the reinforcement learning search tree model; generating a next state quantity according to the state quantity and the action, and generating a space debris removal sequence and a corresponding benefit value when the generated state quantity is a termination state quantity; repeatedly executing the sequence generating step until the number of the clearing sequences reaches a first preset number, and selecting the clearing sequence with the biggest profit value as the optimal clearing sequence of the space debris to be cleared; the method and the device can reduce the time consumption for generating the optimal clearing sequence and improve the energy utilization rate of the aircraft.

Description

Planning method and device for space debris removal task

[ field of technology ]

The invention belongs to the technical field of space debris removal, and particularly relates to a planning method and device for a space debris removal task.

[ background Art ]

Space debris has become a significant obstacle and threat to human aerospace activity and on-orbit spacecraft. The space debris of the low rail is densely distributed and has great harm. Due to the Kessler effect, collisions between fragments and orbital disturbances lead to a sharp increase in the number of spatial fragments and a continuously expanding distribution range. The latest spatial debris development prediction report shows: by 2014, the number of the low-rail space fragments is far beyond the predicted value, and the development of the low-rail space fragment removal technology is not slow.

Active debris removal (Active Debris Removal, ADR) is a technology for cleaning low-orbit space debris by capturing one by one, how to design an optimal task sequence and multiple intersection tracks of an ADR spacecraft is a primary problem faced by current multi-debris active removal, and at present, common planning methods of the ADR spacecraft include a genetic algorithm-based planning algorithm and Deep Q Learning (DQN) algorithm in the text On the Application of Reinforcement Learning in Multi-debris Active Removal Mission Planning, but the time consumption for searching the optimal removal sequence in the planning methods is long, the precision is low, and the optimal task sequence and multiple intersection tracks are easy to sink into a local optimal scheme, so that the energy utilization rate of the spacecraft can be reduced.

[ invention ]

The invention aims to provide a planning method and a planning device for a space debris removal task, so as to reduce the time consumption for generating an optimal removal sequence and improve the energy utilization rate of an aircraft.

The invention adopts the following technical scheme: a method of planning a space debris removal task, comprising:

acquiring space debris information to be cleaned and spacecraft state information;

constructing a reinforcement learning search tree model according to the space debris information to be cleaned and the spacecraft state information; wherein the state quantity, action and benefit values are included in the reinforcement learning search tree model;

generating a sequence: generating actions by adopting an uplink tree searching method according to the initial state of the reinforcement learning search tree model; generating a next state quantity according to the state quantity and the action, and generating a space debris removal sequence and a corresponding benefit value when the generated state quantity is a termination state quantity;

in the expanding process of the uplink tree searching method, selecting actions through random or built neural network models; in the simulation process of the uplink tree searching method, a random simulation mode is selected as a simulation mode;

and repeatedly executing the sequence generating step until the number of the clearing sequences reaches a first preset number, and selecting the clearing sequence with the biggest profit value as the optimal clearing sequence of the space debris to be cleared.

Further, in the generating sequence step, when the generated state quantity is a non-termination state quantity, the following steps are repeatedly performed:

generating actions by adopting a first strategy in the method of searching the upper-definition tree according to the current state quantity;

generating a next state quantity according to the current state quantity and the generated action;

and judging whether the next state quantity is a termination state quantity.

Further, generating the next state quantity according to the current state quantity and the generated action includes:

acquiring a current state quantity;

when the current state quantity is a non-termination state quantity and meets a first preset condition, selecting actions from a pre-constructed action library according to a first strategy;

and generating a next state quantity according to the selected action and the current state quantity.

Further, when the current state quantity is a non-termination state quantity and the first preset condition is not satisfied:

generating a second strategy by using the current state quantity as input information and adopting an expansion and simulation process in the uplink tree searching method;

and selecting actions from a pre-constructed action library according to a second strategy.

Further, the method for generating the second strategy by adopting the expansion and simulation process in the method for searching the upper-definition tree comprises the following steps:

Selecting actions from a pre-constructed action library through a random or constructed neural network model according to the current state quantity;

generating a next state quantity according to the action and the current state quantity;

randomly selecting actions from a pre-constructed action library according to the state quantity until the state quantity is a termination state quantity, and generating a clearing sequence of space fragments to be cleared;

updating a benefit value for each spatial fragment in the purge sequence;

when the number of purge sequences reaches a second predetermined number, a second policy is generated based on the benefit value for each spatial fragment.

Further, when the number of purge sequences does not reach the second predetermined number, repeating the steps of:

the benefit value for each spatial fragment in the purge sequence is updated.

Further, the method comprises the steps of,

the state quantity comprises the number of the space debris to be cleared, the remaining energy of the spacecraft, the remaining time of the current space debris clearing task, the number of the next space debris to be cleared and the binary expression of all the space debris states;

The actions are actions performed by a spacecraft going from one space debris to another;

the benefit value is a scoring value obtained after taking an action with respect to a state quantity.

Another technical scheme of the invention is as follows: a planning apparatus for a space debris removal task, comprising:

the acquisition module is used for acquiring space debris information to be cleaned and spacecraft state information;

the construction module is used for constructing a reinforcement learning search tree model according to the space debris information to be cleaned and the spacecraft state information; wherein the state quantity, action and benefit values are included in the reinforcement learning search tree model;

a generating module, configured to generate a sequence: generating actions by adopting an uplink tree searching method according to the initial state of the reinforcement learning search tree model; generating a next state quantity according to the state quantity and the action, and generating a space debris removal sequence and a corresponding benefit value when the generated state quantity is a termination state quantity;

And the selection module is used for repeatedly executing the sequence generation step until the number of the clearing sequences reaches a first preset number, and selecting the clearing sequence with the biggest profit value as the clearing sequence of the space debris to be cleared.

Further, in the generating sequence step, when the generated state quantity is a non-termination state quantity, the generating sequence further includes the following modules for repeatedly executing the following steps:

the first action generating module is used for generating actions by adopting a first strategy in the method of searching the uplink-definition tree according to the current state quantity;

the first state quantity generation module is used for generating a next state quantity according to the current state quantity and the generated action;

and the judging module is used for judging whether the next state quantity is a termination state quantity.

Another technical scheme of the invention is as follows: a planning device for a space debris removal task comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor realizes the planning method for the space debris removal task when executing the computer program.

The beneficial effects of the invention are as follows: according to the method, a reinforcement learning search tree model is built according to space debris information to be cleaned and spacecraft information, space debris cleaning problems are converted into state quantity, action and benefit values, the action is generated through an uplink boundary tree search method, specifically, the action is selected through a random or built neural network model in the expansion process of the method, a random simulation mode is selected in a simulation mode of a simulation process, the search times for generating an optimal cleaning sequence are reduced, the generation time of the optimal cleaning sequence is saved, the energy utilization rate of a spacecraft is further improved, and the rationality of space debris cleaning task planning is increased.

[ description of the drawings ]

FIG. 1 is a flowchart of a planning method for a space debris removal task according to an embodiment of the present application;

FIG. 2 is a flow chart of a selection action according to a current state quantity in an embodiment of the present application;

FIG. 3 is a schematic diagram of a search tree structure of a space debris removal task according to an embodiment of the present application;

FIG. 4 is a block diagram of a neural network used in an embodiment of the present application;

FIG. 5 is a graph of neural network loss trained in an embodiment of the present application;

fig. 6 is a graph of the convergence effect of the neural network trained in the embodiment of the present application.

[ detailed description ] of the invention

The invention will be described in detail below with reference to the drawings and the detailed description.

The problem of spatial debris removal is described and designed to accommodate the method needs of the embodiments of the present application.

In the embodiment of the application, the orbit transfer strategy of the spacecraft uses a drift orbit transfer strategy of pulse thrust, and the transfer energy consumption is calculated in advance through discrete time points.

Assume that there are N spatial patches waiting to be cleared, where N (+.N) spatial patches are off-track in a particular order. The space-fragmentation solution is in the form of d= (d (1), d (2), …, d (n)) ^T Wherein, d (i) is 1.ltoreq.d (i) is N, and d (i) is the number of the ith cleared space debris. The form of the clean-up time solution is t= (t (1), t (2), …, t (n)) ^T Wherein T is ₀ ≤t(1)＜t(2)＜…＜t(n)＜T _max ，T ₀ For the task start time, T _max Is the task end time. The track information of all spatial fragments can be propagated through t (i).

For the time-varying system described above, the embodiments of the present application provide a maximum benefit model to determine the solutions d and t. The model creates the maximum task income under the constraint of task time and transfer energy, and is specifically as follows:

where G (i) > 0, which is the clean-up benefit of the ith spatial debris, G (d) is the benefit of the overall task.

Under the task profit model, the task time and transfer energy constraint conditions also need to be satisfied:

wherein C is _v For each stage of transfer energy consumption (i.e. transfer energy consumption from last cleared space debris to next cleared space debris), C _Velocity Transfer energy consumption for whole space debris removal process, C _d C for transfer time consumption of each stage _Duration The transfer time consumption for the entire spatial debris removal process. DeltaV _max The transfer pulse energy available for an entire orbit transfer spacecraft (OTV), notably G (d) can be at the discretion of the designer, and can be the length of the solution or the accumulated fragment RCS value (i.e. Radar Cross Section, radar reflection area).

In solving the above-mentioned problems, embodiments of the present application may be implemented by adopting a reinforcement learning method. The general reinforcement learning framework is composed of a state set S and an action set A. After a spacecraft is in state s, the spacecraft is finishedAct a (in action set a), propagate through the environment to reach a new state s', and obtain a benefit R _a (t, s'). The spacecraft is operated appropriately in each state to maximize the cumulative revenue. The mapping from the current state s to action a is the policy pi(s).

The state design of the embodiment of the application is as follows:

the state of reinforcement learning needs to be able to fully express the information that the planner needs at each step. Since the loss function information is known, the present state quantity design contains an n+4 variable. The contents are shown in the following table,

TABLE 1 State vector

State numbering	1	2	3	4	5-(N+4)
						Element(s)	Number of fragments left	ΔV _left	T _left	Current target fragment	Binary identifier

Wherein, state 1 is the number of remaining space debris to be cleared, state 2 is the remaining transfer pulse energy (i.e. the remaining energy) of the spacecraft, state 3 is the remaining time of the clearing task, state 4 is the number of the next target space debris to be cleared, state 5 is the state information (1 represents cleared, 0 represents not cleared) that indicates the clearing of all the space debris using the binary flag bit.

For example, there are 5 space fragments to be cleared in the clearing task, the number of the cleared space fragments at the current time is 2, and the numbers of the cleared space fragments are 1 and 3, respectively, and the state is 10100. All possible state vectors above constitute a state space S.

The state vector (e.g., in a table) may express not only all solutions in the solution space (i.e., the purge sequence), but also state information of the OTV (transition spacecraft), and the state vector does not contain the previously completed actions. There are various situations of the state vector

If n=320, Δv _max ＝3000m/s，T _max For 365 days, then 7.5×10 is calculated according to the above formula ¹⁰⁴ A state.

The actions of the embodiment of the application are designed as follows:

in conventional reinforcement learning, actions are time-synchronized, as states tend to be propagated through time. However, the state design presented in this embodiment is for a multi-fragment cleanup task state, which is determined by the number of cleanup steps. All possible actions a are given in the following formulas and constitute set a.

a＝[d,Δt] ^T (4)

Wherein d is the target clearance fragment, belonging to the collection

Δt is the time taken for transfer to the target fragment, belonging to the set +.>

In the present embodiment, the false The time when the spacecraft reaches the target to remove the fragments is the time when the spacecraft leaves the target to remove the fragments, so the action construction method without the lamp strip process is suitable for reinforcement learning.

The state propagation design of the embodiment of the application is as follows:

the next state is given by the state propagation function (5):

s'＝T(s,a) (5)

where s' is the state information of the spacecraft at the next moment, and is determined by the state information s of the spacecraft at the current moment and the action a, and the T function is expanded to the formula (6):

wherein, note C _v By parameters (s 4]，a[1]，T _max -s[3]，T _max -s[3]-a[2]) Derived from a pre-calculated loss matrix.

The benefit function of the embodiment of the application is designed as follows:

the reinforcement learning benefit function design is based on a maximum benefit optimization model. The benefit of each step is r:

r＝R(s,a) (7)

where R is a benefit function that expresses the benefits obtained under the conditions of state s and action a. In offline optimization of spatial debris removal path planning, the yield function is assumed to be deterministic, i.e., eR (s, a) |a= [ d, Δt] ^T ]=g (d), and the cumulative benefit of all actions is a special case of traditional reinforcement learning:

wherein the future gain attenuation factor gamma is selected to be 1.

According to the above, the search tree structure of the space debris removal task in the present embodiment is designed to implement the reinforcement learning algorithm. Based on the state, action, and revenue design, different branches of the search tree can be expressed by FIG. 3.

Starting from initial state information at the initial time, the search tree continuously selects action information, obtains benefits according to the selected action information, and reaches a new state according to the state information and the action information to be unfolded to obtain the new state information until reaching a termination state. The termination state is that the currently cleared space debris is the termination space debris, and the termination space debris meets the following conditions: the number of the residual space debris to be cleaned is 0, the residual energy of the spacecraft is 0, or the residual time of the cleaning task is 0.

At this time, the algorithm needs to be reset to the initial node to restart, that is, after the initial state information and the initial action information of the initial moment of the spacecraft are determined, the steps are continuously executed.

The selection action in the search tree may be selected randomly or through a pre-built neural network model. In this embodiment, in order to pursue the success rate of the selected action for forming the cleaning sequence later, a pre-built neural network model is used to select the action, and the neural network model outputs a probability of selecting each of the selectable actions to finally successfully form the cleaning sequence, where the probability gradually approaches the probability of obtaining the optimal result, because the current objective function is to maximize the total profit.

The action selection probability distribution of the plan is hidden in a pre-calculated loss function, namely a transfer energy consumption matrix to be continuously accessed in the optimization process. Because how much benefit the current de-sequence (i.e., the purge sequence) can get in the future is not obtained until the termination state is reached. The search tree needs to be built up step by step during the course of repeated experiments of offline optimization. As more data is generated, the search tree gradually approaches the optimal action selection probability distribution.

In practical terms, however, search trees tend to have very large depths and breadth, and are not exhaustive. In order to speed up the search, some methods such as monte carlo tree search (Monte Carlo Tree Search, MCTS) balance exploration with empirical utilization and neural networks can also be used to guide the current state values. An excellent search algorithm should be able to estimate the probability that an action will eventually reach an optimal solution so that the optimal solution can be explored without excessive searching. At the same time, it can also use less experience to avoid getting into local optimum. Therefore, based on the above, the planning method of the space debris removal task in the embodiment is adopted, and the method combines the neural network evaluation method and the tree search process, so that the optimal solution, namely the removal sequence of the optimal space debris, can be quickly and well completed.

In the planning method for a space debris removal task of this embodiment, as shown in fig. 1, first, space debris information to be removed and spacecraft state information need to be acquired. The spatial fragment information includes: the space debris trajectory parameters, quality, RCS, etc. may calculate transfer pulse consumption and clearance benefit parameters. The spacecraft state information comprises the current task, the residual transfer pulse, the residual task time and the like of the spacecraft

After space debris information to be cleared and spacecraft state information are acquired, constructing a reinforcement learning search tree model according to the space debris information to be cleared and the spacecraft state information; wherein state quantity, action and benefit values are included in the reinforcement learning search tree model.

The state quantity includes the number of remaining space debris to be cleared, the remaining energy of the spacecraft, the remaining time of the current space debris clearing task, the next space debris number to be cleared, and a binary representation of all space debris states. The actions are actions performed by a spacecraft going from one space debris to another. The benefit value is a scoring value obtained after taking an action with respect to a state quantity.

Before the generation of the spatial debris removal sequence, each state in the generation process is defined. In the present embodiment, since states and actions are defined under a step framework, not under a time framework, a decision sequence (s ₁ ,a ₁ ,r ₁ ；s ₂ ,a ₂ ,r ₂ ；s ₃ ,a ₃ ,r ₃ The method comprises the steps of carrying out a first treatment on the surface of the …) is a step of space debris cleaning. Then define the state value function as:

wherein pi (|s) is the policy to select actions at state s, γ ⁱ To eliminate the dissipation factor of future benefits of the ith spatial debris, it is considered 1 in the planning problem of this embodiment. The state value function can express how much total benefit can be expected in the future following the strategy pi in the current state s.

Because the state transfer function is deterministic, the optimal state action pair value function is

If convergence is to the optimal Q value, then pi(s) at that time, i.e., the optimal strategy, the stateful action pair value function is:

Q ^π (s,a)＝E _π {R(s,a)+γV ^π (s′)} (10)

based on the above definition, the steps of generating the sequence in this embodiment specifically include:

generating actions by adopting an uplink tree searching method according to the initial state of the reinforcement learning search tree model; and generating a next state quantity according to the state quantity and the action, and generating a space debris removal sequence and a corresponding benefit value when the generated state quantity is a termination state quantity.

In the expanding process of the uplink tree searching method, selecting actions through random or built neural network models; in the simulation process of the method for searching the uplink tree, a random simulation mode is selected as the simulation mode.

And repeatedly executing the sequence generating step until the number of the clearing sequences reaches a first preset number, and selecting the clearing sequence with the biggest profit value as the optimal clearing sequence of the space debris to be cleared. The first predetermined number, a predetermined value, indicates that a number of alternative purge sequences are generated.

And if the number of the clearing sequences does not reach the first preset number, repeating the step of generating the sequences until the number of the clearing sequences reaches the first preset number, and considering that the currently generated clearing sequences comprise the optimal clearing sequences.

In the generating sequence step, when the generated state quantity is a non-termination state quantity, the following steps are repeatedly performed:

and judging whether the next state quantity is a termination state quantity or not.

According to the space debris removal sequence generation method, the reinforcement learning search tree model is built according to the space debris information to be removed and the spacecraft information, the space debris removal problem is converted into the state quantity, the action and the profit value, the action is generated through the upper-definition tree search method, specifically, the action is selected through the random or built neural network model in the expansion process of the method, the random simulation mode is selected in the simulation mode of the simulation process, the search times for generating the optimal removal sequence are reduced, the generation time of the optimal removal sequence is saved, the energy utilization rate of the spacecraft is further improved, and the rationality of space debris removal task planning is increased.

An improved uplink tree search (Modified Upper Confidential Tree search, UCT) algorithm is specifically employed in this embodiment to generate a specific spatial debris removal sequence. The UCT algorithm used by the method combines the MCTS and the upper confidence and validation (UCB) method, so that balanced search and utilization are realized, and the result reaches the history center level.

The basic framework of the UCT algorithm is selection, expansion, evaluation, and update, and no random simulation (roll out) step is used in the UCT algorithm of AlphaGo Zero, because in the UCT algorithm of AlphaGo Zero, it is desirable to evaluate the outcome of a game through a neural network without exploring to the end of the game. However, in the ADR problem of the present embodiment, the ability of the rolout step to function as a very good analog state value provides more reliable feedback for each update, at the heart of the improved UCT algorithm, before the neural network converges and before it has an evaluable ability.

In this embodiment, the benefit value in each purge sequence is superimposed by the benefit value of the spatial fragments therein. In the initial situation, the initial state information and the initial action information of the determined initial moment of the spacecraft are used as root nodes of UCT searching, and the corresponding subtrees (namely the space fragments to be cleared) are selected to access according to the possibility of obtaining higher benefits by downward expansion. "UCT search" will output action selection distribution pi, and the search tree will be extended continuously at this step as well.

In order to make the UCT search more feasible, nodes of the search tree need to maintain node information during the process of creating and updating. In addition to the current node's state s', action a, pre-action state s, there are node access times N (s, a), node history accumulated benefits W (s, a), node Q values, node selection probabilities P (s, a), and the like. When a new node is created, all node information is Q (s, a) =0, n (s, a) =0, w (s, a) =0, p (s, a) =p _i Creation, wherein p _i Is an initial constant. Once a node is revisited, the update policy, equation (11), is followed.

As shown in fig. 2, a flowchart of the UCT-ADR method for improving ADR task special item in this embodiment is shown, in which node information of the entire search tree is stored in RAM until the search tree is not supported by an oversized machine, a selection policy (policy) is generated according to all child node information of the root node, and a next beat representing legal actions is selected from the nodes according to policy probability and simulated annealing method. The UCT search has four steps, namely selection, expansion, random simulation and reverse updating. This loop continues for a certain amount, expanding the search tree and updating node information by expanding new nodes. As the search process can be considered as an agent accessing the search tree.

In this embodiment, generating the next state quantity from the current state quantity and the generated action includes:

acquiring a current state quantity;

The above steps are the selection steps in the UCT search.

Selecting:

first, a current state quantity is acquired. In this embodiment, the current state quantity refers to state information of the spacecraft at the current moment, where the state information includes position information of the current spacecraft, and the position information is the position information where the cleared space debris is located.

After the position information of the current spacecraft is obtained, the information of the current cleared space debris can be obtained, condition judgment is carried out according to the information of the space debris, the judging condition comprises two parts, wherein the first part is whether the cleared space debris is a termination space debris, and the other part is a first preset condition, and the first preset condition can be a certain tree width condition.

When meeting: when the current state quantity is a non-termination state quantity and meets a first preset condition, selecting actions from a pre-constructed action library according to a first strategy; the selected action information is passed through

Selecting, wherein->

For UCB term, the term can restrain the exploring degree outside the current search tree, c _puct Parameters of the exploration ratio are adjusted.

In the above process, the state quantity corresponding to the termination space fragment is the termination state quantity, and the state quantity corresponding to the non-termination space fragment is the non-termination state quantity.

If the space debris cleared by the spacecraft is the non-termination space debris, generating a next state quantity according to the previous state quantity and the action repeatedly, extracting space debris information cleared by the current spacecraft according to the state quantity, and judging the space debris information until the current cleared space debris is the termination space debris.

Expansion:

in the selecting step of the present embodiment, when the current state quantity is a non-termination state quantity and the first preset condition is not satisfied, an expanding step of the UCT search is entered.

In the step, the current state quantity is used as input information, a second strategy is generated by adopting an expansion and simulation process in the upper limit tree search method, and then actions are selected from a pre-constructed action library according to the second strategy.

The method for generating the second strategy by adopting the expansion and simulation process in the method for searching the upper-limit tree comprises the following steps:

updating a benefit value for each spatial fragment in the purge sequence;

In the UCT-ADR algorithm, as shown in a neural network diagram (12), an ADR task state vector s is input, an estimated state value v and an action selection distribution p and theta in the state are output, and the theta is a neural network parameter. P (s, a) =p (a) aids in node information update.

f _θ (s)＝(p,v)

\*MERGEFORMAT(12)

The neural network architecture is different for different problems. In view of the fact that the residual network (ResNet) can well avoid overfitting, the neural network structure used in the embodiment of the present application is shown in FIG. 4, and techniques of reduced learning rate, batch processing, memory playback, etc. are used.

Memory consists of historical access data < s, z, pi > stored in memory, which is generated by the outer loop cycle. Each outer loop cycle will produce a corresponding number of memory data corresponding to the length of the output solution, and all state vectors s along the solution point to the same final benefit z. And the batch learning mechanism can randomly select a specific amount of memory from the memory bank for learning.

The loss function of the training neural network is equation (13), which expresses the difference between the neural network predicted state value v (otherwise known as predicted final benefit) and the final experimental benefit z, and also expresses the estimated motion profile p, the UCT-ADR output motion selection profile pi. The weight L2 norm normalization is also considered.

l＝(z-v) ² -π ^T logp+c||θ|| ² (13)

Random simulation:

after the expansion, no new space debris to be cleaned is accessed, and at this time, random simulation is required to evaluate the situation of the remaining space debris to be cleaned, so as to provide a selection policy for finally selecting a proper action to operate. Since the process is not simulated, this step does not perform any update operations on the purge sequence. The simulation in this embodiment of the present application uses a random principle, continuously selects actions until reaching a certain termination node, and the benefit at this time is used as a future possible benefit corresponding to the current solution vector, that is, for the space debris that is not cleared (the space debris that is not cleared is randomly selected as the space debris to be cleared), the above steps are repeated until the cleared space debris is the termination space debris, and at this time, according to the selected order, a clearing sequence of the space debris to be cleared is generated, so as to complete the random simulation process.

By means of the random simulation process, time for generating the clearing sequence can be reduced, calculation efficiency is improved, and performance requirements on hardware (such as a memory and a processor) can be reduced.

And (5) reversely updating:

when a space debris removal sequence is generated in the simulation process, in the reverse updating process of the embodiment, the benefit value of each space debris involved in the space debris removal sequence is updated to provide a more excellent strategy for the subsequent action selection, and the information (i.e., benefit value, etc.) of each space debris in the removal sequence of the space debris to be removed is updated, wherein the benefit value updating adopts the following formula:

N(s,a)＝N(s,a)+1

W(s,a)＝W(s,a)+z (11)

Q(s,a)＝W(s,a)/N(s,a)

P(s,a)＝(1-ε)P(s,a)+εη

wherein z is a state value generated by rollout, epsilon is a weight, eta is noise for increasing the robustness of the neural network, P (s, a) is part of output of the neural network, and the probability of action a selection under the state s is expressed and is input to UCT search for reference.

After updating the benefit value of each space debris, when the number of the cleaning sequences for cleaning the space debris reaches a second preset number, generating a next action information selection strategy, and selecting the next action according to the action information selection strategy.

The generation of the next action selection policy is specifically:

Wherein s is ₀ Is spacecraft state information input to the UCT search at the last moment, τ is a simulated annealing parameter of outer circle circulation control, b is filled with a second index of N (s, a) to represent possible actions, and the denominator of the formula expresses the sum of the values of all possible actions. Expansion of the search tree is initially more exploratory. As the search proceeds, the simulated annealing temperature decreases, and the strategy of the UCT search output is focused purely on more excellent nodes, making the search more biased towards utilization. The selection of the final action node is selected based on the probability provided by the policy.

When the number of clear sequences does not reach the second predetermined number, the following steps are repeatedly performed:

selecting actions from a pre-constructed action library through a random or constructed neural network model according to the current state quantity; generating a next state quantity according to the action and the current state quantity; randomly selecting actions from a pre-constructed action library according to the state quantity until the state quantity is a termination state quantity, and generating a clearing sequence of space fragments to be cleared; the benefit value for each spatial fragment in the purge sequence is updated.

The method combines the reinforcement learning model with a Monte Carlo tree search Method (MCTS), provides an improved confidence upper bound tree search (UCT-ADR) method, adapts to planning of multi-fragment active cleaning tasks, can effectively reduce the calculated amount of the planning tasks, does not sink into local optimum, and has excellent performance.

In another embodiment of the present application, a planning apparatus for a space debris removal task is disclosed, including:

In the generating sequence step, when the generated state quantity is a non-termination state quantity, the generating sequence step further includes the following modules for repeatedly executing the following steps:

In another embodiment of the present application, a planning apparatus for a space debris removal task is also disclosed, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements a method for planning a space debris removal task as described above when executing the computer program.

Verification example:

in this embodiment, the dataset consists of 10 fragments in an iridium 33 fragment cloud. The task goal is set to clear the number of fragments at a time, thereby defining G (d). The task objective is determined at the termination node as a benefit function and used for node information update. The neural network parameters are optimized by a random gradient descent method (SGD). Experiments in multiple scenarios will verify the validity of the framework and algorithm. The final result will be compared with the usual reinforcement learning algorithm DQN.

Experimental data set:

the experimental data set is 10 pieces of fragment data taken from an iridium 33 fragment cloud, as shown in table 2. The track data is used only to calculate the transfer fuel consumption, which is passed through J ₂ Drift orbit transfer under perturbation.

Table 2 experimental data set

Assuming that the fragment tracks in the collection are all circular tracks, NORAD expresses space object numbers maintained by the North American air defense united commander (North American Aerospace Defense Command), a is a semi-long axis of the tracks, I is a track inclination angle, and Ω is a track intersection point and a right ascent. The data set is captured from the TLE file of the united states department of defense and air defense, the file time being 2017, 5, UTC time 17126.

The actions in this experiment are expressed by the < target fragment, transition interval > pair index number in action space. The action space is to first align the transition intervals and then align the fragment sequences as shown in table 3.

TABLE 3 action space

/>

Task parameters:

as a method of mission planning, to determine mission boundaries, the space multi-chip clearing spacecraft has a 1000m/s orbit transfer capability, and the mission starts from day 1, #1 chip, and goes through 365 days or the orbit transfer energy is exhausted. All results were obtained with an intel i5-7200U notebook computer. Because the aggregate size is not an all-iridium 33 fragment cloud dataset, the computation is affordable, so the exhaustive search is also contrasted to learn the global optimal solution and the search tree overview.

a) Exhaustive search:

using the depth-first search as an exhaustive search algorithm, the search ended 124024 termination nodes after 601 seconds. The optimal solution is expressed as [1,30,40,20,60 ] in action sequence]The solution vector is [1,1;3,28;4,28;2,28;6,28] ^T . The length of all accessed solutions is statistically [/,31,320,1052,762,0,0,0,0,0]Each bit represents the count of the number of solutions of each length, wherein all the solutions are not repeatedly accessed and can reflect the rough search treeAnd (3) the situation.

b) Experiment 1: UCT search without neural network evaluation:

to demonstrate the performance of the UCT search in the ADR problem, experiment 1 tested 2 outer loop cycles of the UCT-ADR. The iteration limit t=300 of the UCT search of the inner ring is set. 982 of the created search trees are total, 104 of which are non-terminating nodes. The optimal solution is expressed as [1,30,39,19,59] in the action sequence, and the solution vector is [1,1;3,22;4,19;2,19;6,19]. Its Q value is obtained by W/N, i.e. the average gain of the node obtained after constant searching. The Q value is 5690/1500= 3.7933 at the root node of the entire search tree. The Q values of all nodes on the solution vector are shown in table 4.

TABLE 4 optimal solution for the Q value of each node (experiment 1 after 2 outer loops)

Action	W	N	W/N
				1	5690	1500	3.7933
30	5170	1232(82.13％)	4.196
				39	4346	949(77.03％)	4.580
19	3177	645(67.97％)	4.926
				59	2245	449(69.61％)	5

Meanwhile, the case of each layer of sibling nodes where the node on the solution vector is located is shown in table 5.

TABLE 5 policy distribution at the level of nodes for optimal solution (experiment 1 after 2 outer loops)

Layer\first 6 items

1

2

3

4

5

6

First layer

30(82.13％)

27(1.93％)

18(1.93％)

28(1.93％)

29(1.40％)

25(1.33％)

Second layer

39(77.03％)

20(3.97％)

15(2.35％)

18(2.03％)

38(1.70％)

36(0.97％)

Third layer

19(67.97％)

60(3.06％)

59(3.06％)

58(3.06％)

17(3.06％)

56(2.95％)

Fourth layer

59(69.61)

60(22.95)

/

After continuing the search for 1000 outer loops, the stored optimal solution yield [1,30,18,37,57] is still 5, while the average yield of the search tree root node is 7498117/1500000 = 4.9987. The average benefit of the root node near 5 indicates that the entire search tree has gradually converged to the optimal solution at a later stage, a converged policy distribution as shown in table 6.

TABLE 6 probability distribution at the layer where each node is located for the optimal solution (experiment 1 after 1000 outer loops)

Layer\first 6 items

1

2

3

4

5

6

First layer

30(99.98253％)

39(0.00187％)

26(0.00187％)

/

Second layer

18(99.98006％)

38(0.00240％)

20(0.00193％)

/

Third layer

37(99.98059％)

40(0.00280％)

38(0.00240％)

35(0.00193％)

/

Fourth layer

57(25.01274％)

60(24.99273％)

59(24.99273％)

58(24.99273％)

/

1067 nodes were accessed, 125 of which were non-terminating nodes, indicating that more nodes were accessed. The first optimal profit 5 node in the algorithm was found after 140 seconds. The above results may indicate that the UCT search is feasible in the algorithm and that the optimal solution can be found. However, in some attempts there are still problems with local optimality and loss of exploratory properties.

c) Experiment 2: UCT-ADR algorithm:

when the complete UCT-ADR algorithm is used, the neural network is trained by using the search entries which are increased continuously, the learning rate alpha is reduced from 0.001 to 0.00001, the memory space size is 2000, and the batch size is 10 for test. The neural network structure adopts a residual network which is the best to avoid overfitting, and two ResNet are respectively fitted with two outputs. The state value network is composed of a full connection network of 20 nodes and a residual network of 20 nodes, and the output is processed through a softmax function. The policy network is similar to the state value network except that the output layer is connected to another fully connected network whose outputs are via a relu function. The design of the two networks is shown in figure 4.

After 1000 outer loop iterations, the trained neural network loss is shown in fig. 5. Changing the batch learning size to 100, and obtaining better convergence effect after 3500 outer loop iterations is completed is shown in fig. 6. Which can converge to output the optimal solution for 303 consecutive cycles.

After 3500 outer loop iterations, the optimal solution is [1,28,17,36,60], and the length gain of all accessed solutions is [61,87,102,132,3118,0,0,0,0,0], in 3500 loops, 3118 optimal solutions are searched out. The final search tree is analyzed in depth in table 8, and it can be seen that after convergence, each node on the optimal solution has a main probability of selection, where "subtree width" represents the degree of exploration of the node. The first three layers all converge to an optimal solution. But there are three nodes at the fourth level with similar probability of selection because there are no suitable non-terminating nodes after them. The probability of selection reflected by the number of accesses N can reflect the possibility of obtaining higher profits. As can be seen from the above results, convergence is completed in both the neural network estimation and the optimal solution generation, and if the UCT search is not performed, the neural network will not be different from learning the whole loss function; without the update of the neural network and the random simulation guidance node information, UCT searching is easy to fall into local optimum, and the optimal solution cannot be found more robustly.

After 3500 outer loop iterations, the optimal solution is [1,28,17,36,60], and the length gain statistics of all accessed solutions is [160,846,1960,534,0,0,0,0,0,0], in 3500 loops, 3118 optimal solutions are searched out. The final search tree is further analyzed in tables 7 and 8, and it can be seen that after convergence, each node on the optimal solution has a main selection probability, where "subtree width" represents the exploration degree of the node. The first three layers all converge to an optimal solution. But there are three nodes at the fourth level with similar probability of selection because there are no suitable non-terminating nodes after them. The probability of selection reflected by the number of accesses N can reflect the possibility of obtaining higher profits.

As can be seen from the above results, convergence is completed in both the neural network estimation and the optimal solution generation, and if the UCT search is not performed, the neural network will not be different from learning the whole loss function; without the update of the neural network and the random simulation guidance node information, UCT searching is easy to fall into local optimum, and the optimal solution cannot be found more robustly.

Table 7 optimal solution each node Q value (experiment 2 after 3500 outer cycles)

Table 8 probability distribution of the layer where each node is located for the optimal solution (experiment 2 after 3500 outer loops)

Subtree width	Layer\first 4 items	1	2	3	4
						86/90	First layer	27(99.9950\％)	38(3.6190e-06)	19(3.6190e-06)	15(3.0476e-06)
76/80	Second layer	16(99.9945\％)	17(6.6670e-06)	37(5.1431e-06)	36(4.0002e-06)
						66/70	Third layer	35(99.994\％)	39(8.0008e-06)	37(6.0959e-06)	33(4.5719e-06)
56/60	Fourth layer	59(40.006\％)	58(30.002\％)	57(29.989\％)	/
						46/50	Fifth layer	69(2.1770\％)	79(2.1765\％)	68(2.1763\％)	Probability of similarity

Note that: the fifth layer is the termination node.

d) Comparison:

the method of the embodiment of the present application finds the optimal solution more quickly than 601 seconds of the exhaustion method, and after completing 100 UCT-ADRs, finds the optimal solution after completing 138.05 cycles for an average 164.44 seconds (setting the outer loop iteration number limit m=500 for faster completion statistics), except for 3 failures. In contrast, the method is efficient as a planner for ADR in a reinforcement learning framework. When solving the problem using deep reinforcement learning (DQN), the planner continuously generates an optimal length solution after about 100 iterative learning of DQN in 1954 seconds. The optimal solution accounts for about 1/50 of all output solutions of the DQN, and this ratio is 31/35 in the present method, which is sufficient to prove that the method of the embodiment of the present application is very suitable for the high efficiency of ADR under reinforcement learning framework modeling.

The original UCT algorithm used by AlphaGo Zero was also examined as a comparison in the present problem. The optimal solution was not found after 3500 iterations 8032 seconds. Statistics of

solution length

3500,3340,2490,534,0,0,0,0,0,0 show that it cannot find the optimal solution or continuously output the optimal solution. The original UCT algorithm cannot find any optimal solution at m=500 by 100 independent experimental attempts. The reason is that the neural network which is not trained cannot be used for evaluating the state value function at the beginning, the searching direction is biased, and the searching is difficult to be pulled back from the subtree which cannot find the optimal solution at the later stage, so that the optimal solution cannot be obtained efficiently.

Therefore, after a reinforcement learning framework for solving the multi-fragment active clearance task planning is proposed, the UCT-ADR algorithm can be combined with random simulation, so that exploration and utilization are balanced, and a better result is output more efficiently.

The method introduces a reinforcement learning framework for guiding and planning the multi-fragment active cleaning task, can be used for off-line task optimization, and has the advantage that the solution vector dimension is not required to be fixed because the mode of continuously collecting the maximum benefit of the task planning model is consistent with the reinforcement learning framework.

Compared with the prior other algorithms, the method provided by the application has the advantages that the ADR task is changed, the operation is more efficient, and the results show that the method is excellent in the aspects of finding out the optimal solution, outputting the proportion of the optimal solution, searching is not easy to fall into local optimal, and the like.

Claims

1. A method for planning a space debris removal task, comprising:

constructing a reinforcement learning search tree model according to the space debris information to be cleaned and the spacecraft state information; wherein state quantity, action and benefit values are included in the reinforcement learning search tree model;

Generating a sequence: generating actions by adopting an uplink tree searching method according to the initial state of the reinforcement learning search tree model; generating a next state quantity according to the state quantity and the action, and generating a space debris removal sequence and a corresponding benefit value when the generated state quantity is a termination state quantity; the method for searching the uplink tree consists of a Monte Carlo tree searching method and UCB items; wherein the UCB term is

U (s, a) is UCB item, s represents state, a represents action, c _puct In order to adjust the parameters of the exploration proportion, P (s, a) represents the node selection probability, N (s, b) represents the node access times when the state is s and the action is b, and N (s, a) represents the node access times;

in the expanding process of the method for searching the up-bound tree, selecting actions through a random or built neural network model; in the simulation process of the method for searching the uplink-definition tree, a random simulation mode is selected as a simulation mode;

repeatedly executing the step of generating sequences until the number of the clearing sequences reaches a first preset number, and selecting the clearing sequence with the biggest profit value as the optimal clearing sequence of the space debris to be cleared;

the action is generated by adopting an upper-limit tree searching method

Execution, pi (a|s) ₀ ) Representing action selection policies, s ₀ Is the spacecraft state information input to the last time of the UCT search, N(s) ₀ A) represents a state s ₀ Node access times when acting as a, N(s) ₀ B) represents a state s ₀ The node access times when the action is b, τ is a simulated annealing parameter of the outer ring circulation control, and b represents possible actions;

the action is performed by the spacecraft from one space debris to another space debris;

the benefit value is a scoring value obtained after taking an action for a state quantity.

2. A method of planning a space debris removal task according to claim 1, wherein in the generating sequence step, when the generated state quantity is a non-termination state quantity, the following steps are repeatedly performed:

3. A method of planning a space debris removal task according to claim 2, wherein generating a next state quantity based on the current state quantity and the generated action comprises:

acquiring a current state quantity;

when the current state quantity is a non-termination state quantity and a first preset condition is met, selecting actions from a pre-constructed action library according to the first strategy;

4. A method of planning a space debris removal task according to claim 3, wherein when the current state quantity is a non-termination state quantity and the first preset condition is not satisfied:

and selecting actions from a pre-constructed action library according to the second strategy.

5. The method for planning a spatial debris removal task of claim 4, wherein the method for generating the second strategy using the expansion and simulation process in the method for searching the upsilon tree comprises:

updating a benefit value for each spatial fragment in the purge sequence;

6. A method of planning a spatial debris removal task according to claim 5, wherein when the number of removal sequences does not reach a second predetermined number, the following steps are repeated:

the benefit value for each spatial fragment in the purge sequence is updated.

7. A planning apparatus for a space debris removal task, comprising:

The construction module is used for constructing a reinforcement learning search tree model according to the space debris information to be cleaned and the spacecraft state information; wherein state quantity, action and benefit values are included in the reinforcement learning search tree model;

a generating module, configured to generate a sequence: generating actions by adopting an uplink tree searching method according to the initial state of the reinforcement learning search tree model; generating a next state quantity according to the state quantity and the action, and generating a space debris removal sequence and a corresponding benefit value when the generated state quantity is a termination state quantity; the method for searching the uplink tree consists of a Monte Carlo tree searching method and UCB items; wherein the UCB term is

The selection module is used for repeatedly executing the sequence generation step until the number of the clearing sequences reaches a first preset number, and selecting the clearing sequence with the biggest profit value as the clearing sequence of the space debris to be cleared;

the action is generated by adopting an upper-limit tree searching method

Execution, pi (a|s) ₀ ) Representing action selection policies, s ₀ Is the spacecraft state information input to the last time of the UCT search, N(s) ₀ A) represents a state s ₀ Node access times when acting as a, N(s) ₀ B) represents a state s ₀ The node access times when the action is b, τ is a simulated annealing parameter of the outer ring circulation control, and b represents possible actions; />

8. The space debris removal task planning apparatus of claim 7, further comprising, in the generating sequence step, when the generated state quantity is a non-termination state quantity, means for repeatedly performing the steps of:

and the judging module is used for judging whether the next state quantity is a termination state quantity or not.

9. A planning device for a space debris removal task, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements a method for planning a space debris removal task according to any one of claims 1 to 6 when executing the computer program.