CN116107213A

CN116107213A - Spacecraft pursuit task combination optimization control method based on SAC and LGVF

Info

Publication number: CN116107213A
Application number: CN202310159415.2A
Authority: CN
Inventors: 周林; 程聪聪; 冷俊芳; 张梦; 丁鑫龙; 魏倩; 彭青蓝; 姚鸿泰; 晏加元; 邱倩
Original assignee: Henan University
Current assignee: Henan University
Priority date: 2023-02-23
Filing date: 2023-02-23
Publication date: 2023-05-12

Abstract

The invention provides a spacecraft pursuit task combination optimization control method based on SAC and LGVF, which comprises the steps of firstly dividing tasks in a pursuit scene into different stages by using a hierarchical control method, and establishing a hierarchical simplified model of the spacecraft pursuit task; secondly, an improved deep reinforcement learning flexible actor critics (SAC) algorithm is provided to establish an autonomous motion planning control architecture, and the capability of processing dynamic uncertain states is provided for the pursuing spacecraft; finally, the li-apunov guided vector field method (LGVF) was introduced under the framework of the improved SAC algorithm to form a combined control method, compressing the solution space size to optimize the solution process in a huge solution space. The method can enable the pursuing spacecraft to autonomously finish the pursuing task of the spacecraft in a scene with observable and unpredictable external information parts, thereby not only providing real-time autonomous control capability, but also improving the success rate of the task.

Description

Spacecraft pursuit task combination optimization control method based on SAC and LGVF

Technical Field

The invention relates to the technical field of autonomous spacecraft control, in particular to a combined optimal control method for a spacecraft pursuit task based on SAC and LGVF.

Background

The space fight problem of the spacecraft is a research hot spot in the field of the current air combat, with the improvement of performance, the spacecraft has the functions of not only battlefield investigation, but also performing the pursuit task, and the conversion from a reconnaissance platform to a combat platform is completed.

Technical researches on the task of the spacecraft pursuit have been developed in the early 70 th century abroad, and the core purpose of the technical researches is to control the spacecraft to carry out the task of tracking the target under the condition of avoiding threat and guaranteeing self safety. Conventional methods for solving the combinatorial optimization problem include exact, approximate, and heuristic algorithms, many of which have proven to beReliable and reliableStable. The traditional method rarely utilizes the common characteristics among the problems to obtain a universality solution, and a new solver needs to be established to solve different examples of a similar problem, so that the method cannot be applied to the dynamic combination optimization problem with scene moment change.

Disclosure of Invention

The invention aims to provide a spacecraft pursuit task combination optimization control method based on SAC and LGVF, which solves two problems of the spacecraft pursuit task in a dynamic unknown environment, namely, the unpredictability of external information, such as information of escape mode, threat position, firepower range and the like of a target; and secondly, partial observability of external information, wherein only partial states in the environment can be obtained through a series of sensors of the spacecraft.

The invention adopts the technical scheme that:

a spacecraft pursuit task combination optimization control method based on SAC and LGVF specifically comprises the following steps:

step 1: the method comprises the following specific steps of establishing a spacecraft pursuit task scene model:

the invention relates to a spacecraft pursuit task in a dynamic unknown environment, which is described as a process of pursuing a threat existing in a scene where a spacecraft needs to fly and pursuing the spacecraft for dynamic escape, and an optimization function model is established for the problem as shown in a formula (1):

mint _c ＝G[f(P),f(E),f(T _i )] (1)

objective function t _c Refers to the object of the pursuit spacecraft P to capture the escape spacecraft E in the shortest time, G [ f (P), f (E), f (T) _i )]Refers to the fusion of scene overall measurement information, f (P), f (E), f (T) _i ) Respectively refers to state information of a pursuit spacecraft, an escape spacecraft and each threat T _i Representing an ith threat;

the dynamic differential model of the pursuit spacecraft and the escape spacecraft is built for the scene as shown in a formula (2):

wherein ,x_i ,y _i As the current position information of the spacecraft,

respectively x _i ,y _i I.e. the component of the velocity in the direction of the two vectors; v _i Representing the speed of a spacecraft, a _i Representing acceleration of spacecraft, ++>

Is the velocity v _i Is equal to the acceleration a _i ；ψ _i Representing heading angle omega of spacecraft _i Representing the angular velocity of a spacecraft,/->

Is heading angle psi _i The differential quantity of (a) i.e. the course angle change rate, is equal to the angular velocity omega _i . The angular velocity value of the pursuing spacecraft depends on the model output of the reinforcement learning algorithm;

the state and initial state of the spacecraft and threat are shown in formula (3):

wherein ,x_i0 ,y _i0 Representing the initial position of a spacecraft or threat, deltax _i ,Δy _i Representing the variation of the position of the spacecraft or threat, x _i (t ₀ ),y _i (t ₀ ) Representing the space vehicle or threat from an initial position x _i0 ,y _i0 Through Deltax _i ,Δy _i A new position generated by a change in displacement of (a); v _i0 Representing initial speed of spacecraft or threat, v _i (t ₀ ) Represents any t ₀ Speed of the spacecraft or threat at the moment; psi phi type _i0 Initial angle, Δω, representing heading angle of spacecraft or threat _i Represents the change in heading angle of a spacecraft or threat _i (t ₀ ) Heading angle representing spacecraft or threat is defined by an initial angle ψ _i0 Through Deltaomega _i A new heading angle generated by an angle change of (a); r is R _i0 Representing the fire power action range of spacecraft or threat initialization, R _i (t ₀ ) Representing a spacecraft or threat at any t ₀ Range of fire power action at moment and range of initial fire power action R _i0 Remain unchanged; the individual threats are randomly distributed in the scene to simulate scene complexity;

The condition that the scene task is successful is set as that the distance between the pursuit spacecraft P and the escape spacecraft E is smaller than the pursuit range of the pursuit spacecraft, as shown in a formula (4):

d _PE ≤R _P (4)

wherein ,d_PE Is the distance R between the pursuing spacecraft P and the escaping spacecraft E _P Refers to the pursuit action range of the pursuit spacecraft P;

the condition of the scene task failure is set as that the weapon ranges of the pursuit spacecraft P and the threat T overlap, namely, the distance between the pursuit spacecraft and the threat is smaller than the safety distance, as shown in a formula (5):

wherein ,

refers to the distance between the pursuit spacecraft and each threat; l is the safe distance between the pursuit spacecraft and the threat, and is defined as the sum of the action ranges of the pursuit spacecraft and the threat, namely +.>

Wherein, according to the formula (3),

representing the ith threat T _i Threat coverage of (2);

step 2: according to the spacecraft pursuit task scene model established in the step 1, respectively designing a state space model, an action space model and a state transition model of the pursuit spacecraft and the escape spacecraft;

step 3: according to the established spacecraft pursuit task scene model in the step 1, a layering simplified model is established for the spacecraft pursuit task through a layering control method, so that the spacecraft pursuit task is simplified into a multi-level subtask;

Step 4: according to the layering simplified model of the established spacecraft pursuit task, an improved deep reinforcement learning flexible actor critique algorithm is provided to establish an autonomous motion planning control framework, so that the capability of processing a dynamic uncertain state is provided for the pursuit spacecraft, and the current dynamic pursuit scene needing real-time optimization is met;

step 5: according to the layering simplified model of the established spacecraft pursuit task, under the autonomous motion planning control architecture of an improved SAC algorithm, a combination method is formed by introducing a Liapunov guide vector field method, and a SAC algorithm learning process is optimized to form a combination optimization method;

step 6: applying the combined optimization method in the step 5 to the established spacecraft pursuit task layering simplified model in the step 4, and training an autonomous motion planning model of the pursuit spacecraft;

step 7: and (3) loading the autonomous motion planning model of the pursuit spacecraft trained in the step (6) into an online pursuit task simulation scene with observable and unpredictable information part for testing, and perfecting the combined optimization method through the test effect feedback of the pursuit spacecraft for executing the pursuit task.

The step 2 specifically comprises the following steps:

step 2.1: state space model design of a pursuit spacecraft, an escape spacecraft and a threat, and the design is specific:

Setting a pursuit spacecraft, an escape spacecraft, threat airborne GPS equipment and a gyroscope, and acquiring own position information and speed information; the following aircraft airborne fire control radar load equipment can acquire the position information and the speed information of a target, as shown in a formula (6):

f(i)＝[x _i ,y _i ,v _i ,ψ _i ]i＝P,E,T (6)

the method uses the relative information relation to establish a State space model State, so that the measurement space size is compressed, the input processing pressure of the neural network is reduced to improve the algorithm performance, the algorithm is focused on learning a solving scheme, and the expression design of the State space model is shown as a formula (7):

wherein ,d_PE Refers to the distance between the pursuing spacecraft and the escaping spacecraft,

refers to the distance between the pursuit spacecraft and each threat, alpha _PE Is the angle between the velocity direction of the pursuing spacecraft P and the escape spacecraft E and the target line of sight LOS, +.>

Is to chase spacecraft P and each threat T _i An included angle between a velocity direction of the target line of sight LOS, the LOS refers to a vector direction pointing to the target from the position of the pursuit spacecraft;

step 2.2: design of action space models of a pursuing spacecraft and an escape spacecraft, and specifically:

the control input of the pursuing spacecraft is designed to be angular velocity and acceleration, the spacecraft is set to uniform motion by a kinetic equation of the spacecraft, and the Action space Action is shown as a formula (8):

Action＝[ω] (8)

Setting the maximum angular velocity of the spacecraft to be less than 25.5rad/sec, namely omega E < -25.5,25.5 >, and taking the anticlockwise direction of the top view as the positive direction;

step 2.3: design of state transition models of a pursuing spacecraft and an escape spacecraft, and specifically:

the spacecraft motion state transfer equation is shown in formula (9):

wherein i refers to a pursuit spacecraft and an escape spacecraft; the spacecraft is in the current state s _t Take action A _t Acquiring the change quantity of state transition through interaction with a scene, and acquiring the current state s _t Added to the variation to obtain the next state s _t+1 。

Step 3, according to the spacecraft pursuit task scene model established in step 1, establishing a layering simplified model for the spacecraft pursuit task through a layering control method, so that the spacecraft pursuit task is simplified into a multi-level subtask, which specifically comprises the following steps:

firstly, a first-stage task refers to that when the pursuit spacecraft does not receive threat signals from measurement information of the environment, the pursuit spacecraft is required to continuously move towards the escape spacecraft under the driving of a designed autonomous motion planning model;

secondly, the second-stage task refers to when the pursuing spacecraft receives threat information from the measurement information of the environment, namely, when the threat appears in the LOS sight of the pursuing spacecraft to the target, the current task of the pursuing spacecraft is to execute a round-the-fly evasion maneuver to the threat; at this time, the pursuit spacecraft should fly into the flight track set by the LGVF method, then execute the fly-around action along the track, and finally cut out the current second-stage task and cut into the first-stage task when the pursuit target, the pursuit spacecraft and the threat form an obtuse angle, that is, the pursuit spacecraft has successfully flown around the threat, and continue to move to the escape spacecraft.

The step 4 specifically comprises the following steps:

the autonomous motion planning control architecture is established through the SAC algorithm of deep reinforcement learning, the end-to-end algorithm property of the autonomous motion planning control architecture can enable the pursuit spacecraft to learn the common characteristics of the problem in a training mode, and meanwhile, an offline training model can be directly used for online test application, so that the capability of processing a dynamic uncertain state is provided, and the current dynamic pursuit scene requiring real-time optimization is met; the self-adaptive entropy coefficient according to different subtasks of the pursuit spacecraft improves the learning efficiency of the SAC algorithm and the tracking precision of the pursuit spacecraft;

for a general deep reinforcement learning algorithm, the learning target is to learn a strategy so as to maximize the accumulated expected value of rewards obtained by interaction of the pursuing spacecraft with the environment, and the form is as shown in a formula (10):

wherein ,R(s_t ,a _t ) Representing in state s _t Take action a _t Obtained return value, policy pi ^* The goal of (2) is to maximize the overall return value desired; the SAC algorithm belongs to a maximum entropy strategy reinforcement learning algorithm, namely, besides the basic target, the action entropy of each output of the strategy is required to be maximum, and the form is shown as a formula (11):

where α is the entropy coefficient, H (pi (|s) _t ) The objective of requiring maximum entropy is mainly to randomize the strategy, i.e. the probability of each action output is balanced as much as possible, and the pursuit spacecraft needs to explore all possible optimal paths.

The specific process of the step 5 is as follows:

aiming at the situation that the threat randomly appears in the LOS view angle in a complex scene, the pursuit spacecraft is required to carry out evasion maneuver decision in real time: at this time, a trajectory controller designed according to the Lyapunov guidance vector field algorithm is used for threat T _i Establishing a vector field model, and designing an evasion track of the pursuit spacecraft to guide the pursuit spacecraft to stably fly around the threat;

the design return function is in the form of formula (12):

r＝r _a +r _b +r _c (12)

wherein r represents the total return value, r _a The sparse return function is represented, specifically, the return given by the pursuit spacecraft when successfully capturing the target and the return given by the pursuit spacecraft when intercepted are shown in the form of a formula (13):

r _b the method is one of designed guiding type rewarding return components, the included angle between the speed direction of each iterative step pursuing spacecraft and escaping spacecraft and the target sight LOS is calculated in the whole process, and the weight mu is used for calculating the target sight LOS ₁ The dimension is controlled and then added to the total return value, and the form is shown as a formula (14):

r _b ＝-μ ₁ α _PE (14)

r _c The designed guiding type rewarding return component is two, the process of the pursuing spacecraft flying to the escape spacecraft is divided into two stages, and the form is shown as a formula (15):

the stage 1 refers to that the pursuing spacecraft does not sense the existence of threat, namely, when the pursuing spacecraft is in a first-stage task, the pursuing spacecraft continuously moves towards the escaping spacecraft under the driving of a designed autonomous motion planning model, and r is as follows _c The guiding function is not exerted; stage 2 refers to when a threat occurs in a path where a vector of the pursuit spacecraft points to the escape spacecraft, that is, the pursuit spacecraft is in a second stage task, a vector field model is built for the threat T according to a lyapunov guidance vector field algorithm, and the radius of the vector field is designed to be d _PT +w _P, wherein w_P The method refers to the pursuit of the fuselage width of the spacecraft, and threat level is expressed in an exponential function form; at this time r _c Guiding the distance between the control of the pursuit spacecraft and the threat, otherwise, generating a negative return value through the weight mu ₂ And controlling the report value dimension and adding the result to the total report value as punishment.

Step 6, the combined optimization method is applied to an established spacecraft pursuit task scene, an autonomous motion planning model of the pursuit spacecraft is trained, and the specific process comprises the following steps:

Step 6.1: loading a spacecraft pursuit task into a deep reinforcement learning algorithm, and specifically:

in the established spacecraft pursuit task scene, the pursuit spacecraft is in the current state s _t Take action a _t Transition to the next state s _t+1 A Markov process that is considered as an introduced action and reward in reinforcement learning; five-tuple according to a Markov process model

S is the State space in the process, namely the State, A is the Action set, namely the Action,>

is a time sequence set, p is a state transition probability function, and gamma is a return function when the state steps;

at each decision time t, a state transition probability matrix p is obtained in a limited action space _t As shown in equation (16):

p _t ＝[p(s′|s,a ₁ )...p(s′|s,a _N )] (16)

wherein s is the current moment state of the pursuit spacecraft, and the action { a } is executed ₁ ,...,a _N Entering a new state s'; a, a _i For the ith action in the action space, N is the total number of actions in the action space, p (s' |s, a) _i ) Action a is executed on behalf of the pursuit spacecraft _i Probability of reaching the new state s'; return matrix R generated by interaction of pursuit spacecraft and environment _i (s, a) and the total return value function R (s, a) formed by the same are shown in the formula (17) and the formula (18):

R _i (s,a)＝γ(s _i ,a _i ) (17)

R(s,a)＝∑ _t γ(s _t ,a _t ) (18)

the total return function is combined with the maximum entropy strategy in the claim 4 to obtain the SAC algorithm optimization target as shown in the formula (11);

Step 6.2: training an autonomous motion planning model of the pursuit spacecraft, and specifically:

by adjusting parameters of the network, an autonomous motion planning model is perfected, and actions of the pursuing spacecraft in different states are guided; the training time of the model is shortened and the training efficiency of the model is maximized by adjusting the learning rate; guiding the optimizing process of the pursuit spacecraft in a huge action space by adjusting the weight of each return in the return function; the learning efficiency of the algorithm is improved and tracking precision of the pursuit spacecraft is improved by adjusting the entropy coefficient; through the adjustment, the total return value of the algorithm tends to be converged to a constant steadily along with the increase of training times.

The step 7 specifically comprises the following steps:

firstly, adjusting parameters of a SAC algorithm network model and an autonomous motion planning framework structure according to feedback results, adapting to a spacecraft pursuit task scene, and perfecting the autonomous motion planning framework in the step 4;

secondly, adjusting the LGVF model parameters in the step 5 according to the feedback result, so that the pursuit spacecraft designed by the vector field has more stable avoidance track to the threat and is more suitable for the spacecraft pursuit task scene;

finally, parameters of the autonomous motion planning model in the step 6.2 are adjusted according to feedback results, and model training efficiency and tracking accuracy of the pursuit spacecraft are improved.

The combination optimization method is perfected by the feedback optimization means of the algorithm.

The beneficial effects of the invention are as follows:

through the technical scheme, the invention provides a spacecraft pursuit task combination optimization control method based on SAC and LGVF, and belongs to the technical field of spacecraft autonomous control. The invention provides a combined optimization method aiming at the problem of multi-interception pursuit game, wherein the spacecraft pursuit task is divided into a plurality of layers of subtasks by a hierarchical control method according to the established spacecraft pursuit task scene; an improved SAC algorithm is provided, and a spacecraft autonomous motion planning control architecture is established; the method of introducing the Liapunov guide vector field is used for designing a track controller of the pursuit spacecraft. The invention combines the optimizing capability, the powerful perception capability of deep learning and the heuristic learning capability of reinforcement learning of the traditional model algorithm to form a new combined optimizing algorithm, and the method can be applied to the scene of autonomous avoidance of multiple interceptors in the task of executing tracking and the like of a spacecraft by making new progress in the technical field of autonomous control of the spacecraft. The method has small requirements on calculation resources of the spacecraft, provides real-time autonomous control capability, and improves task success rate.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are embodiments of the present invention and that other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a two-dimensional planar geometric model diagram of the present invention;

FIG. 3 is a model training convergence diagram of a combined optimization algorithm LGVF-SAC proposed by the present invention;

FIG. 4 is a model training convergence diagram of a comparative algorithm Original SAC of the combined optimization algorithm provided by the invention;

Detailed Description

As shown in fig. 1 to fig. 4, the spacecraft pursuit task combination optimization control method based on SAC and LGVF according to the present embodiment is specifically implemented by the following steps:

step 1: the method comprises the following specific processes of establishing a scene model of the chase game problem:

mint _c ＝G[f(P),f(E),f(T _i )] (1)

Objective function t _c Refers to the following spacecraft P (burst) targeting the shortest time for capturing the Escape spacecraft E (Escape), G [ f (P), f (E), f (T) _i )]Refers to the fusion of scene overall measurement information, f (P), f (E), f (T) _i ) The state information of the pursuit spacecraft, the escape spacecraft and the Threat T (thread) is respectively referred to.

wherein ,x_i ,y _i As the current position information of the spacecraft,

wherein ,x_i0 ,y _i0 Representing the initial position of a spacecraft or threat, deltax _i ,Δy _i Representing the variation of the position of the spacecraft or threat, x _i (t ₀ ),y _i (t ₀ ) Representing the space vehicle or threat from an initial position x _i0 ,y _i0 Through Deltax _i ,Δy _i A new position generated by a change in displacement of (a); v _i0 Representing initial speed of spacecraft or threat, v _i (t ₀ ) Represents any t ₀ Speed of the spacecraft or threat at the moment; psi phi type _i0 Initial angle, Δω, representing heading angle of spacecraft or threat _i Represents the change in heading angle of a spacecraft or threat _i (t ₀ ) Heading angle representing spacecraft or threat is defined by an initial angle ψ _i0 Through Deltaomega _i A new heading angle generated by an angle change of (a); r is R _i0 Representing the fire power action range of spacecraft or threat initialization, R _i (t ₀ ) Representing a spacecraft or threat at any t ₀ Range of fire power at moment of time, and initialRange of fire power R _i0 Remain unchanged; the individual threats are randomly distributed in the scene to simulate scene complexity;

d _PE ≤R _P (4)

wherein ,

Wherein, according to the formula (3),

representing the ith threat T _i Threat coverage of (2);

step 2, respectively designing a state space model of a pursuit spacecraft, an escape spacecraft and a threat according to the scene model established in the step 1, and a motion space model of the pursuit spacecraft and the escape spacecraft and a state transfer model of the pursuit spacecraft and the escape spacecraft, wherein the step 2 specifically comprises the following steps:

f(i)＝[x _i ,y _i ,v _i ,ψ _i ]i＝P,E,T (6)

Is to chase spacecraft P and each threat T _i The angle between the velocity direction of the target line of sight, LOS, is the vector direction pointing from the position of the pursuit spacecraft to the target.

the control input of the pursuing spacecraft is designed to be angular velocity and acceleration, in order to focus an algorithm on learning a solving scheme, the invention assumes that a dynamics equation of the spacecraft sets the spacecraft to uniform motion, and Action space actions are shown as a formula (8):

Action＝[ω] (8)

the method adopts the angular velocity of the common spacecraft to design an action space, sets the maximum angular velocity of the spacecraft to be less than 25.5rad/sec, namely omega E < -25.5,25.5 >, and takes the anticlockwise direction of the top view as the positive direction.

the spacecraft motion state transfer equation is shown in formula (9):

wherein i refers to a pursuit spacecraft and an escape spacecraft. The spacecraft is in the current state s _t Take action A _t Acquiring the change quantity of state transition through interaction with a scene, and acquiring the current state s _t Added to the variation to obtain the next state s _t+1 。

Step 3, according to the established spacecraft pursuit task scene, a layering simplified model is established for the spacecraft pursuit task through a layering control method, so that the spacecraft pursuit task is simplified into a multi-layer subtask, and the method specifically comprises the following steps:

firstly, the first-stage task refers to that when the pursuing spacecraft does not receive threat signals from measurement information of the environment, the pursuing spacecraft is required to continuously move towards the escaping spacecraft under the driving of a designed autonomous motion planning model.

Secondly, the second stage task refers to when the pursuing spacecraft receives threat information from the measurement information of the environment, namely, when the threat appears in the LOS sight of the pursuing spacecraft to the target, the current task of the pursuing spacecraft is to execute a round-the-fly evasion maneuver to the threat. At this time, the pursuit spacecraft should fly into the flight track set by the LGVF method, then execute the fly-around action along the track, and finally cut out the current second-stage task and cut into the first-stage task when the pursuit target, the pursuit spacecraft and the threat form an obtuse angle, that is, the pursuit spacecraft has successfully flown around the threat, and continue to move to the escape spacecraft.

According to the measurement information of the pursuit spacecraft on the current environment, a layering simplified model is built for the pursuit spacecraft task through a layering control method, so that the pursuit spacecraft task is simplified into the two-stage task.

Step 4, according to the layered simplified model of the established spacecraft pursuit task, an improved depth reinforcement learning flexible actor critic (SAC) algorithm is provided to establish an autonomous motion planning control framework, the capability of processing dynamic uncertain states is provided for the pursuit spacecraft, and the current dynamic pursuit scene needing real-time optimization is satisfied, specifically:

the autonomous motion planning control architecture is established through the SAC algorithm of deep reinforcement learning, the end-to-end algorithm property of the autonomous motion planning control architecture can enable the pursuit spacecraft to learn the common characteristics of the problem in a training mode, and meanwhile, an offline training model can be directly used for online test application, so that the capability of processing a dynamic uncertain state is provided, and the current dynamic pursuit scene requiring real-time optimization is met. The self-adaptive entropy coefficient according to different subtasks of the pursuit spacecraft improves the learning efficiency of the SAC algorithm and the tracking precision of the pursuit spacecraft.

For a general deep reinforcement learning algorithm, the learning target is to learn a strategy model so as to maximize the accumulated expected value of rewards obtained by interaction of the pursuing spacecraft with the environment, and the form is as shown in a formula (10):

wherein ,R(s_t ,a _t ) Representing in state s _t Take action a _t Obtained return value, policy pi ^* The goal of (2) is to maximize the overall return value desired. The SAC algorithm belongs to the maximum entropy strategy reinforcement learning algorithm, namely, besides the above basic objective, the action entropy (entropy) of each output of the strategy is required to be maximum, and the form is shown as a formula (11):

wherein alpha is entropy systemThe number, H (pi (|s) _t ) The objective of requiring maximum entropy is mainly to randomize the strategy, i.e. the probability of each action output is as balanced as possible, meaning that the pursuit spacecraft needs to explore all possible optimal paths. The pursuit spacecraft is not only learned but as many ways to complete tasks as possible, and the learned strategy can be suitable for more complex specific tasks; secondly, the method is more robust, and the pursuit spacecraft can explore various optimal paths from different actions, so that adjustment can be made more easily when the method faces interference.

By utilizing the characteristics of entropy items, the method and the device can adaptively adjust the entropy coefficients according to different stages of tasks where the pursuit spacecraft is located. Firstly, when the pursuit spacecraft does not receive a threat signal from the measurement information of the environment, the pursuit spacecraft should fly into the target continuously, and the pursuit spacecraft should be encouraged to actively explore the optimal track through the maximum entropy strategy, and the automatically adjusted maximum entropy coefficient using the SAC algorithm is reserved at this stage. Secondly, when the pursuit spacecraft receives threat information from the measurement information of the environment, the pursuit spacecraft should fly around the flying track preset by the track controller, and attenuation is realized by multiplying the entropy coefficient by a constant at this stage, so that exploratory behaviors of the pursuit spacecraft are limited, and the pursuit spacecraft can fly stably according to the preset track.

Step 5, forming a combined method by introducing a Lithospermol guiding vector field method under an improved SAC autonomous motion planning control architecture according to a layered simplified model of an established spacecraft pursuit task, optimizing a learning process of an SAC algorithm, and forming the combined optimization method, wherein the specific process is as follows:

aiming at the situation that the threat randomly appears in the LOS view angle in a complex scene, the pursuit spacecraft is required to carry out evasion maneuver decision in real time. At this time, a trajectory controller designed according to the Lyapunov guidance vector field algorithm is used for threat T _i And establishing a vector field model, and designing an evasion track of the pursuit spacecraft to guide the pursuit spacecraft to stably fly around the threat.

The design return function is in the form of formula (12):

r＝r _a +r _b +r _c (12)

wherein r represents the total return value, r _a Represents a sparse return function, in particular a return given when the pursuit spacecraft successfully captures the target and a return given when the pursuit spacecraft is intercepted (negative value), the form of which is shown in formula (13):

r _b ＝-μ ₁ α _PE (14)

the stage 1 refers to that the pursuing spacecraft does not sense the existence of threat, namely, when the pursuing spacecraft is in a first-stage task, the pursuing spacecraft continuously moves towards the escaping spacecraft under the driving of a designed autonomous motion planning model, and r is as follows _c The guiding function is not exerted; stage 2 refers to when a threat occurs in a path where a vector of the pursuit spacecraft points to the escape spacecraft, that is, the pursuit spacecraft is in a second stage task, a vector field model is built for the threat T according to a lyapunov guidance vector field algorithm, and the radius of the vector field is designed to be d _PT +w _P, wherein w_P Refers to pursuing the fuselage width of the spacecraft, and threat level is expressed in the form of an exponential function. At this time r _c Guiding the distance between the control of the pursuit spacecraft and the threatOtherwise, the generated negative return value passes through the weight mu ₂ And controlling the report value dimension and adding the result to the total report value as punishment.

Step 6, applying the combined optimization method to an established spacecraft pursuit task scene to train an autonomous motion planning model of the pursuit spacecraft, wherein the specific process comprises the following steps:

in the established spacecraft pursuit task scene, the pursuit spacecraft is in the current state s _t Take action a _t Transition to the next state s _t+1 Is considered a markov process of incoming actions and rewards in reinforcement learning. Five-tuple according to a Markov process model

is a time series set, p is a state transition probability function, and gamma is a return function when the state steps.

p _t ＝[p(s′|s,a ₁ )...p(s′|s,a _N )] (16)

wherein s is the current moment state of the pursuit spacecraft, and the action { a } is executed ₁ ,...,a _N Entering a new state s'; a, a _i For the ith action in the action space, N is the total number of actions in the action space, p (s' |s, a) _i ) Action a is executed on behalf of the pursuit spacecraft _i Probability of reaching the new state s'. Return matrix R generated by interaction of pursuit spacecraft and environment _i (s, a) and the total return value function R (s, a) formed by the same are shown in the formula (17) and the formula (18):

R _i (s,a)＝γ(s _i ,a _i ) (17)

R(s,a)＝Σ _t γ(s _t ,a _t ) (18)

the total return function is combined with the maximum entropy strategy in claim 4 to obtain the SAC algorithm optimization objective as shown in the above formula (11).

by adjusting parameters of the network, an autonomous motion planning model is perfected, and actions of the pursuing spacecraft in different states are guided; the training time of the model is shortened and the training efficiency of the model is maximized by adjusting the learning rate; guiding the optimizing process of the pursuit spacecraft in a huge action space by adjusting the weight of each return in the return function; by adjusting the entropy coefficient, the algorithm learning efficiency and tracking accuracy of the pursuit spacecraft are improved. Through the adjustment, the total return value of the algorithm tends to be converged to a constant steadily along with the increase of training times.

And 7, loading the autonomous motion planning model of the pursuit spacecraft trained in the step 6 into an online pursuit task scene with observable and unpredictable information parts, and perfecting the combined optimization method through test effect feedback of the pursuit spacecraft execution pursuit task. Firstly, adjusting parameters of a SAC algorithm network model and an autonomous motion planning framework structure according to feedback results, adapting to a spacecraft pursuit task scene, and perfecting the autonomous motion planning framework; secondly, adjusting the parameters of the LGVF model according to the feedback result, so that the pursuit spacecraft designed by the vector field has more stable avoidance track to the threat and is more suitable for the spacecraft pursuit task scene; finally, algorithm parameters are adjusted according to the feedback result and the model training scheme in the step 6.2, and the model training process is executed again. The combination optimization method is perfected by the feedback optimization means of the algorithm.

With the development of artificial intelligence technology in recent years, the deep learning technology breaks the barriers of the traditional method in many fields, and makes remarkable breakthrough progress, and as an important branch of deep learning, the deep reinforcement learning is mainly used for making sequential decisions, namely making action selection according to the current environmental state and continuously adjusting a model according to feedback of the action, so as to achieve the set target. The combined optimization problem is optimally selected in a discrete decision space, has the characteristics similar to the natural characteristic of 'action selection' of reinforcement learning, and the characteristics of 'offline training and online decision' of deep reinforcement learning enable the online real-time solution of the combined optimization problem to be realized, so that the combined optimization problem is well selected by utilizing a deep reinforcement learning method.

In consideration of the complexity problems of unpredictability, partial observability and the like of external information in a scene in actual application, the scene is difficult to solve by directly establishing a mathematical model, and the optimality of directly outputting the solution by a learned non-model method is difficult to ensure. In order to ensure the flexibility and reliability of the solving method, a better scheme is to solve the problem through a combined optimization method constructed in a combined learning mode. Firstly, dividing tasks in a pursuit scene into different stages by using a hierarchical control method, and establishing a hierarchical simplified model; secondly, an improved deep reinforcement learning flexible actor critic (SAC) algorithm is provided to establish an autonomous motion planning control framework, so that the capability of processing a dynamic uncertain state is provided for the pursuing spacecraft, and the current dynamic pursuing scene needing real-time optimization is met; finally, the li-apunov guided vector field method (LGVF) was introduced under the framework of the improved SAC algorithm to form a combined approach, compressing the solution space size to optimize the solution process in a huge solution space.

Through multiple tests and verification in a simulation environment, the SAC-LGVF combined optimization algorithm provided by the invention can reach a model convergence state through 90000steps of training in each initialization scene, the original SAC algorithm is difficult to converge in scenes with the same layout, and the convergence degree is reduced due to the fact that the training condition is too early and the fitting condition is too early; in addition, as shown in table 1, the two trained models are respectively tested in the same simulation environment, the task execution success of the spacecraft controlled by the original SAC algorithm is low, and the task execution success rate of the spacecraft controlled by the SAC-LGVF combined optimization algorithm is obviously improved.

TABLE 1

Algorithm	Success rate	Convergence speed
			LGVF-SAC	87％	90000steps
Original SAC	26％	Difficult to converge, overfit

The invention simplifies the scene complexity by layering and simplifying the escape scene of the spacecraft, improves the algorithm learning capability and the task execution capability by combining the improved SAC algorithm with the LGVF, not only provides autonomy and dynamic pursuit capability for the spacecraft, but also ensures the control precision and the task success rate of the autonomous execution task of the spacecraft.

Claims

1. The spacecraft pursuit task combination optimal control method based on SAC and LGVF is characterized by comprising the following steps:

min t _c ＝G[f(P),f(E),f(T _i )] (1)

wherein ,x_i ,y _i As the current position information of the spacecraft,

d _PE ≤R _P (4)

wherein ,

Wherein, according to formula (3), the ++>

Representing the ith threat T _i Threat coverage of (2);

2. The spacecraft pursuit task combination optimization control method based on SAC and LGVF is characterized in that the step 2 specifically comprises the following steps:

f(i)＝[x _i ,y _i ,v _i ,ψ _i ]i＝P,E,T (6)

Action＝[ω] (8)

The spacecraft motion state transfer equation is shown in formula (9):

3. The method for optimizing and controlling the combination of the spacecraft pursuit tasks based on the SAC and the LGVF according to claim 1 is characterized in that the step 3 is to build a layered simplified model for the spacecraft pursuit tasks by a hierarchical control method according to the established spacecraft pursuit task scene model in the step 1, so that the spacecraft pursuit tasks are simplified into multi-level subtasks, which is specifically as follows:

4. The spacecraft pursuit task combination optimization control method based on SAC and LGVF according to claim 1, wherein the step 4 is specifically:

/>

5. The spacecraft pursuit task combination optimization control method based on SAC and LGVF according to claim 1, wherein the specific process in step 5 is as follows:

the design return function is in the form of formula (12):

r＝r _a +r _b +r _c (12)

r _b is one of the designed guided rewards component, in the whole processCalculating the included angle between the speed direction of each iterative step pursuit spacecraft and the escape spacecraft and the target sight line LOS, and passing the weight mu ₁ The dimension is controlled and then added to the total return value, and the form is shown as a formula (14):

r _b ＝-μ ₁ α _PE (14)

6. The method for optimizing and controlling the combination of the following tasks of the spacecraft based on SAC and LGVF according to claim 1, wherein the step 6 is characterized in that the method for optimizing the combination is applied to the established scene of the following tasks of the spacecraft, and the autonomous motion planning model of the following steps are trained:

in the established spacecraft pursuit task scene, the pursuit spacecraft is in the current state s _t Take action a _t Transition to the next state s _t+1 A Markov process that is considered as an introduced action and reward in reinforcement learning; the five-tuple (S, a,

p, γ) description, S is the State space in the process, i.e. State, A is the Action set, i.e. Action, +.>

p _t ＝[p(s′|s,a ₁ )...p(s′|s,a _N )] (16)

R _i (s,a)＝γ(s _i ,a _i ) (17)

R(s,a)＝∑ _t γ(s _t ,a _t ) (18)

7. The method for optimizing and controlling the combination of the following tasks of the spacecraft based on the SAC and the LGVF according to claim 6, wherein the step 7 specifically comprises the following steps: