CN116107213A - Spacecraft pursuit task combination optimization control method based on SAC and LGVF - Google Patents

Spacecraft pursuit task combination optimization control method based on SAC and LGVF Download PDF

Info

Publication number
CN116107213A
CN116107213A CN202310159415.2A CN202310159415A CN116107213A CN 116107213 A CN116107213 A CN 116107213A CN 202310159415 A CN202310159415 A CN 202310159415A CN 116107213 A CN116107213 A CN 116107213A
Authority
CN
China
Prior art keywords
spacecraft
pursuit
threat
task
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310159415.2A
Other languages
Chinese (zh)
Inventor
周林
程聪聪
冷俊芳
张梦
丁鑫龙
魏倩
彭青蓝
姚鸿泰
晏加元
邱倩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Henan University
Original Assignee
Henan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Henan University filed Critical Henan University
Priority to CN202310159415.2A priority Critical patent/CN116107213A/en
Publication of CN116107213A publication Critical patent/CN116107213A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B13/00Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion
    • G05B13/02Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric
    • G05B13/04Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators
    • G05B13/042Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators in which a parameter or coefficient is automatically adjusted to optimise the performance
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Feedback Control In General (AREA)

Abstract

The invention provides a spacecraft pursuit task combination optimization control method based on SAC and LGVF, which comprises the steps of firstly dividing tasks in a pursuit scene into different stages by using a hierarchical control method, and establishing a hierarchical simplified model of the spacecraft pursuit task; secondly, an improved deep reinforcement learning flexible actor critics (SAC) algorithm is provided to establish an autonomous motion planning control architecture, and the capability of processing dynamic uncertain states is provided for the pursuing spacecraft; finally, the li-apunov guided vector field method (LGVF) was introduced under the framework of the improved SAC algorithm to form a combined control method, compressing the solution space size to optimize the solution process in a huge solution space. The method can enable the pursuing spacecraft to autonomously finish the pursuing task of the spacecraft in a scene with observable and unpredictable external information parts, thereby not only providing real-time autonomous control capability, but also improving the success rate of the task.

Description

Spacecraft pursuit task combination optimization control method based on SAC and LGVF
Technical Field
The invention relates to the technical field of autonomous spacecraft control, in particular to a combined optimal control method for a spacecraft pursuit task based on SAC and LGVF.
Background
The space fight problem of the spacecraft is a research hot spot in the field of the current air combat, with the improvement of performance, the spacecraft has the functions of not only battlefield investigation, but also performing the pursuit task, and the conversion from a reconnaissance platform to a combat platform is completed.
Technical researches on the task of the spacecraft pursuit have been developed in the early 70 th century abroad, and the core purpose of the technical researches is to control the spacecraft to carry out the task of tracking the target under the condition of avoiding threat and guaranteeing self safety. Conventional methods for solving the combinatorial optimization problem include exact, approximate, and heuristic algorithms, many of which have proven to beReliable and reliableStable. The traditional method rarely utilizes the common characteristics among the problems to obtain a universality solution, and a new solver needs to be established to solve different examples of a similar problem, so that the method cannot be applied to the dynamic combination optimization problem with scene moment change.
Disclosure of Invention
The invention aims to provide a spacecraft pursuit task combination optimization control method based on SAC and LGVF, which solves two problems of the spacecraft pursuit task in a dynamic unknown environment, namely, the unpredictability of external information, such as information of escape mode, threat position, firepower range and the like of a target; and secondly, partial observability of external information, wherein only partial states in the environment can be obtained through a series of sensors of the spacecraft.
The invention adopts the technical scheme that:
a spacecraft pursuit task combination optimization control method based on SAC and LGVF specifically comprises the following steps:
step 1: the method comprises the following specific steps of establishing a spacecraft pursuit task scene model:
the invention relates to a spacecraft pursuit task in a dynamic unknown environment, which is described as a process of pursuing a threat existing in a scene where a spacecraft needs to fly and pursuing the spacecraft for dynamic escape, and an optimization function model is established for the problem as shown in a formula (1):
mint c =G[f(P),f(E),f(T i )] (1)
objective function t c Refers to the object of the pursuit spacecraft P to capture the escape spacecraft E in the shortest time, G [ f (P), f (E), f (T) i )]Refers to the fusion of scene overall measurement information, f (P), f (E), f (T) i ) Respectively refers to state information of a pursuit spacecraft, an escape spacecraft and each threat T i Representing an ith threat;
the dynamic differential model of the pursuit spacecraft and the escape spacecraft is built for the scene as shown in a formula (2):
Figure SMS_1
wherein ,xi ,y i As the current position information of the spacecraft,
Figure SMS_2
respectively x i ,y i I.e. the component of the velocity in the direction of the two vectors; v i Representing the speed of a spacecraft, a i Representing acceleration of spacecraft, ++>
Figure SMS_3
Is the velocity v i Is equal to the acceleration a i ;ψ i Representing heading angle omega of spacecraft i Representing the angular velocity of a spacecraft,/->
Figure SMS_4
Is heading angle psi i The differential quantity of (a) i.e. the course angle change rate, is equal to the angular velocity omega i . The angular velocity value of the pursuing spacecraft depends on the model output of the reinforcement learning algorithm;
the state and initial state of the spacecraft and threat are shown in formula (3):
Figure SMS_5
wherein ,xi0 ,y i0 Representing the initial position of a spacecraft or threat, deltax i ,Δy i Representing the variation of the position of the spacecraft or threat, x i (t 0 ),y i (t 0 ) Representing the space vehicle or threat from an initial position x i0 ,y i0 Through Deltax i ,Δy i A new position generated by a change in displacement of (a); v i0 Representing initial speed of spacecraft or threat, v i (t 0 ) Represents any t 0 Speed of the spacecraft or threat at the moment; psi phi type i0 Initial angle, Δω, representing heading angle of spacecraft or threat i Represents the change in heading angle of a spacecraft or threat i (t 0 ) Heading angle representing spacecraft or threat is defined by an initial angle ψ i0 Through Deltaomega i A new heading angle generated by an angle change of (a); r is R i0 Representing the fire power action range of spacecraft or threat initialization, R i (t 0 ) Representing a spacecraft or threat at any t 0 Range of fire power action at moment and range of initial fire power action R i0 Remain unchanged; the individual threats are randomly distributed in the scene to simulate scene complexity;
The condition that the scene task is successful is set as that the distance between the pursuit spacecraft P and the escape spacecraft E is smaller than the pursuit range of the pursuit spacecraft, as shown in a formula (4):
d PE ≤R P (4)
wherein ,dPE Is the distance R between the pursuing spacecraft P and the escaping spacecraft E P Refers to the pursuit action range of the pursuit spacecraft P;
the condition of the scene task failure is set as that the weapon ranges of the pursuit spacecraft P and the threat T overlap, namely, the distance between the pursuit spacecraft and the threat is smaller than the safety distance, as shown in a formula (5):
Figure SMS_6
wherein ,
Figure SMS_7
refers to the distance between the pursuit spacecraft and each threat; l is the safe distance between the pursuit spacecraft and the threat, and is defined as the sum of the action ranges of the pursuit spacecraft and the threat, namely +.>
Figure SMS_8
Wherein, according to the formula (3),
Figure SMS_9
representing the ith threat T i Threat coverage of (2);
step 2: according to the spacecraft pursuit task scene model established in the step 1, respectively designing a state space model, an action space model and a state transition model of the pursuit spacecraft and the escape spacecraft;
step 3: according to the established spacecraft pursuit task scene model in the step 1, a layering simplified model is established for the spacecraft pursuit task through a layering control method, so that the spacecraft pursuit task is simplified into a multi-level subtask;
Step 4: according to the layering simplified model of the established spacecraft pursuit task, an improved deep reinforcement learning flexible actor critique algorithm is provided to establish an autonomous motion planning control framework, so that the capability of processing a dynamic uncertain state is provided for the pursuit spacecraft, and the current dynamic pursuit scene needing real-time optimization is met;
step 5: according to the layering simplified model of the established spacecraft pursuit task, under the autonomous motion planning control architecture of an improved SAC algorithm, a combination method is formed by introducing a Liapunov guide vector field method, and a SAC algorithm learning process is optimized to form a combination optimization method;
step 6: applying the combined optimization method in the step 5 to the established spacecraft pursuit task layering simplified model in the step 4, and training an autonomous motion planning model of the pursuit spacecraft;
step 7: and (3) loading the autonomous motion planning model of the pursuit spacecraft trained in the step (6) into an online pursuit task simulation scene with observable and unpredictable information part for testing, and perfecting the combined optimization method through the test effect feedback of the pursuit spacecraft for executing the pursuit task.
The step 2 specifically comprises the following steps:
step 2.1: state space model design of a pursuit spacecraft, an escape spacecraft and a threat, and the design is specific:
Setting a pursuit spacecraft, an escape spacecraft, threat airborne GPS equipment and a gyroscope, and acquiring own position information and speed information; the following aircraft airborne fire control radar load equipment can acquire the position information and the speed information of a target, as shown in a formula (6):
f(i)=[x i ,y i ,v ii ]i=P,E,T (6)
the method uses the relative information relation to establish a State space model State, so that the measurement space size is compressed, the input processing pressure of the neural network is reduced to improve the algorithm performance, the algorithm is focused on learning a solving scheme, and the expression design of the State space model is shown as a formula (7):
Figure SMS_10
wherein ,dPE Refers to the distance between the pursuing spacecraft and the escaping spacecraft,
Figure SMS_11
refers to the distance between the pursuit spacecraft and each threat, alpha PE Is the angle between the velocity direction of the pursuing spacecraft P and the escape spacecraft E and the target line of sight LOS, +.>
Figure SMS_12
Is to chase spacecraft P and each threat T i An included angle between a velocity direction of the target line of sight LOS, the LOS refers to a vector direction pointing to the target from the position of the pursuit spacecraft;
step 2.2: design of action space models of a pursuing spacecraft and an escape spacecraft, and specifically:
the control input of the pursuing spacecraft is designed to be angular velocity and acceleration, the spacecraft is set to uniform motion by a kinetic equation of the spacecraft, and the Action space Action is shown as a formula (8):
Action=[ω] (8)
Setting the maximum angular velocity of the spacecraft to be less than 25.5rad/sec, namely omega E < -25.5,25.5 >, and taking the anticlockwise direction of the top view as the positive direction;
step 2.3: design of state transition models of a pursuing spacecraft and an escape spacecraft, and specifically:
the spacecraft motion state transfer equation is shown in formula (9):
Figure SMS_13
wherein i refers to a pursuit spacecraft and an escape spacecraft; the spacecraft is in the current state s t Take action A t Acquiring the change quantity of state transition through interaction with a scene, and acquiring the current state s t Added to the variation to obtain the next state s t+1
Step 3, according to the spacecraft pursuit task scene model established in step 1, establishing a layering simplified model for the spacecraft pursuit task through a layering control method, so that the spacecraft pursuit task is simplified into a multi-level subtask, which specifically comprises the following steps:
firstly, a first-stage task refers to that when the pursuit spacecraft does not receive threat signals from measurement information of the environment, the pursuit spacecraft is required to continuously move towards the escape spacecraft under the driving of a designed autonomous motion planning model;
secondly, the second-stage task refers to when the pursuing spacecraft receives threat information from the measurement information of the environment, namely, when the threat appears in the LOS sight of the pursuing spacecraft to the target, the current task of the pursuing spacecraft is to execute a round-the-fly evasion maneuver to the threat; at this time, the pursuit spacecraft should fly into the flight track set by the LGVF method, then execute the fly-around action along the track, and finally cut out the current second-stage task and cut into the first-stage task when the pursuit target, the pursuit spacecraft and the threat form an obtuse angle, that is, the pursuit spacecraft has successfully flown around the threat, and continue to move to the escape spacecraft.
The step 4 specifically comprises the following steps:
the autonomous motion planning control architecture is established through the SAC algorithm of deep reinforcement learning, the end-to-end algorithm property of the autonomous motion planning control architecture can enable the pursuit spacecraft to learn the common characteristics of the problem in a training mode, and meanwhile, an offline training model can be directly used for online test application, so that the capability of processing a dynamic uncertain state is provided, and the current dynamic pursuit scene requiring real-time optimization is met; the self-adaptive entropy coefficient according to different subtasks of the pursuit spacecraft improves the learning efficiency of the SAC algorithm and the tracking precision of the pursuit spacecraft;
for a general deep reinforcement learning algorithm, the learning target is to learn a strategy so as to maximize the accumulated expected value of rewards obtained by interaction of the pursuing spacecraft with the environment, and the form is as shown in a formula (10):
Figure SMS_14
wherein ,R(st ,a t ) Representing in state s t Take action a t Obtained return value, policy pi * The goal of (2) is to maximize the overall return value desired; the SAC algorithm belongs to a maximum entropy strategy reinforcement learning algorithm, namely, besides the basic target, the action entropy of each output of the strategy is required to be maximum, and the form is shown as a formula (11):
Figure SMS_15
where α is the entropy coefficient, H (pi (|s) t ) The objective of requiring maximum entropy is mainly to randomize the strategy, i.e. the probability of each action output is balanced as much as possible, and the pursuit spacecraft needs to explore all possible optimal paths.
The specific process of the step 5 is as follows:
aiming at the situation that the threat randomly appears in the LOS view angle in a complex scene, the pursuit spacecraft is required to carry out evasion maneuver decision in real time: at this time, a trajectory controller designed according to the Lyapunov guidance vector field algorithm is used for threat T i Establishing a vector field model, and designing an evasion track of the pursuit spacecraft to guide the pursuit spacecraft to stably fly around the threat;
the design return function is in the form of formula (12):
r=r a +r b +r c (12)
wherein r represents the total return value, r a The sparse return function is represented, specifically, the return given by the pursuit spacecraft when successfully capturing the target and the return given by the pursuit spacecraft when intercepted are shown in the form of a formula (13):
Figure SMS_16
r b the method is one of designed guiding type rewarding return components, the included angle between the speed direction of each iterative step pursuing spacecraft and escaping spacecraft and the target sight LOS is calculated in the whole process, and the weight mu is used for calculating the target sight LOS 1 The dimension is controlled and then added to the total return value, and the form is shown as a formula (14):
r b =-μ 1 α PE (14)
r c The designed guiding type rewarding return component is two, the process of the pursuing spacecraft flying to the escape spacecraft is divided into two stages, and the form is shown as a formula (15):
Figure SMS_17
the stage 1 refers to that the pursuing spacecraft does not sense the existence of threat, namely, when the pursuing spacecraft is in a first-stage task, the pursuing spacecraft continuously moves towards the escaping spacecraft under the driving of a designed autonomous motion planning model, and r is as follows c The guiding function is not exerted; stage 2 refers to when a threat occurs in a path where a vector of the pursuit spacecraft points to the escape spacecraft, that is, the pursuit spacecraft is in a second stage task, a vector field model is built for the threat T according to a lyapunov guidance vector field algorithm, and the radius of the vector field is designed to be d PT +w P, wherein wP The method refers to the pursuit of the fuselage width of the spacecraft, and threat level is expressed in an exponential function form; at this time r c Guiding the distance between the control of the pursuit spacecraft and the threat, otherwise, generating a negative return value through the weight mu 2 And controlling the report value dimension and adding the result to the total report value as punishment.
Step 6, the combined optimization method is applied to an established spacecraft pursuit task scene, an autonomous motion planning model of the pursuit spacecraft is trained, and the specific process comprises the following steps:
Step 6.1: loading a spacecraft pursuit task into a deep reinforcement learning algorithm, and specifically:
in the established spacecraft pursuit task scene, the pursuit spacecraft is in the current state s t Take action a t Transition to the next state s t+1 A Markov process that is considered as an introduced action and reward in reinforcement learning; five-tuple according to a Markov process model
Figure SMS_18
S is the State space in the process, namely the State, A is the Action set, namely the Action,>
Figure SMS_19
is a time sequence set, p is a state transition probability function, and gamma is a return function when the state steps;
at each decision time t, a state transition probability matrix p is obtained in a limited action space t As shown in equation (16):
p t =[p(s′|s,a 1 )...p(s′|s,a N )] (16)
wherein s is the current moment state of the pursuit spacecraft, and the action { a } is executed 1 ,...,a N Entering a new state s'; a, a i For the ith action in the action space, N is the total number of actions in the action space, p (s' |s, a) i ) Action a is executed on behalf of the pursuit spacecraft i Probability of reaching the new state s'; return matrix R generated by interaction of pursuit spacecraft and environment i (s, a) and the total return value function R (s, a) formed by the same are shown in the formula (17) and the formula (18):
R i (s,a)=γ(s i ,a i ) (17)
R(s,a)=∑ t γ(s t ,a t ) (18)
the total return function is combined with the maximum entropy strategy in the claim 4 to obtain the SAC algorithm optimization target as shown in the formula (11);
Step 6.2: training an autonomous motion planning model of the pursuit spacecraft, and specifically:
by adjusting parameters of the network, an autonomous motion planning model is perfected, and actions of the pursuing spacecraft in different states are guided; the training time of the model is shortened and the training efficiency of the model is maximized by adjusting the learning rate; guiding the optimizing process of the pursuit spacecraft in a huge action space by adjusting the weight of each return in the return function; the learning efficiency of the algorithm is improved and tracking precision of the pursuit spacecraft is improved by adjusting the entropy coefficient; through the adjustment, the total return value of the algorithm tends to be converged to a constant steadily along with the increase of training times.
The step 7 specifically comprises the following steps:
firstly, adjusting parameters of a SAC algorithm network model and an autonomous motion planning framework structure according to feedback results, adapting to a spacecraft pursuit task scene, and perfecting the autonomous motion planning framework in the step 4;
secondly, adjusting the LGVF model parameters in the step 5 according to the feedback result, so that the pursuit spacecraft designed by the vector field has more stable avoidance track to the threat and is more suitable for the spacecraft pursuit task scene;
finally, parameters of the autonomous motion planning model in the step 6.2 are adjusted according to feedback results, and model training efficiency and tracking accuracy of the pursuit spacecraft are improved.
The combination optimization method is perfected by the feedback optimization means of the algorithm.
The beneficial effects of the invention are as follows:
through the technical scheme, the invention provides a spacecraft pursuit task combination optimization control method based on SAC and LGVF, and belongs to the technical field of spacecraft autonomous control. The invention provides a combined optimization method aiming at the problem of multi-interception pursuit game, wherein the spacecraft pursuit task is divided into a plurality of layers of subtasks by a hierarchical control method according to the established spacecraft pursuit task scene; an improved SAC algorithm is provided, and a spacecraft autonomous motion planning control architecture is established; the method of introducing the Liapunov guide vector field is used for designing a track controller of the pursuit spacecraft. The invention combines the optimizing capability, the powerful perception capability of deep learning and the heuristic learning capability of reinforcement learning of the traditional model algorithm to form a new combined optimizing algorithm, and the method can be applied to the scene of autonomous avoidance of multiple interceptors in the task of executing tracking and the like of a spacecraft by making new progress in the technical field of autonomous control of the spacecraft. The method has small requirements on calculation resources of the spacecraft, provides real-time autonomous control capability, and improves task success rate.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are embodiments of the present invention and that other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a two-dimensional planar geometric model diagram of the present invention;
FIG. 3 is a model training convergence diagram of a combined optimization algorithm LGVF-SAC proposed by the present invention;
FIG. 4 is a model training convergence diagram of a comparative algorithm Original SAC of the combined optimization algorithm provided by the invention;
Detailed Description
As shown in fig. 1 to fig. 4, the spacecraft pursuit task combination optimization control method based on SAC and LGVF according to the present embodiment is specifically implemented by the following steps:
step 1: the method comprises the following specific processes of establishing a scene model of the chase game problem:
the invention relates to a spacecraft pursuit task in a dynamic unknown environment, which is described as a process of pursuing a threat existing in a scene where a spacecraft needs to fly and pursuing the spacecraft for dynamic escape, and an optimization function model is established for the problem as shown in a formula (1):
mint c =G[f(P),f(E),f(T i )] (1)
Objective function t c Refers to the following spacecraft P (burst) targeting the shortest time for capturing the Escape spacecraft E (Escape), G [ f (P), f (E), f (T) i )]Refers to the fusion of scene overall measurement information, f (P), f (E), f (T) i ) The state information of the pursuit spacecraft, the escape spacecraft and the Threat T (thread) is respectively referred to.
The dynamic differential model of the pursuit spacecraft and the escape spacecraft is built for the scene as shown in a formula (2):
Figure SMS_20
wherein ,xi ,y i As the current position information of the spacecraft,
Figure SMS_21
respectively x i ,y i I.e. the component of the velocity in the direction of the two vectors; v i Representing the speed of a spacecraft, a i Representing acceleration of spacecraft, ++>
Figure SMS_22
Is the velocity v i Is equal to the acceleration a i ;ψ i Representing heading angle omega of spacecraft i Representing the angular velocity of a spacecraft,/->
Figure SMS_23
Is heading angle psi i The differential quantity of (a) i.e. the course angle change rate, is equal to the angular velocity omega i . The angular velocity value of the pursuing spacecraft depends on the model output of the reinforcement learning algorithm;
the state and initial state of the spacecraft and threat are shown in formula (3):
Figure SMS_24
wherein ,xi0 ,y i0 Representing the initial position of a spacecraft or threat, deltax i ,Δy i Representing the variation of the position of the spacecraft or threat, x i (t 0 ),y i (t 0 ) Representing the space vehicle or threat from an initial position x i0 ,y i0 Through Deltax i ,Δy i A new position generated by a change in displacement of (a); v i0 Representing initial speed of spacecraft or threat, v i (t 0 ) Represents any t 0 Speed of the spacecraft or threat at the moment; psi phi type i0 Initial angle, Δω, representing heading angle of spacecraft or threat i Represents the change in heading angle of a spacecraft or threat i (t 0 ) Heading angle representing spacecraft or threat is defined by an initial angle ψ i0 Through Deltaomega i A new heading angle generated by an angle change of (a); r is R i0 Representing the fire power action range of spacecraft or threat initialization, R i (t 0 ) Representing a spacecraft or threat at any t 0 Range of fire power at moment of time, and initialRange of fire power R i0 Remain unchanged; the individual threats are randomly distributed in the scene to simulate scene complexity;
the condition that the scene task is successful is set as that the distance between the pursuit spacecraft P and the escape spacecraft E is smaller than the pursuit range of the pursuit spacecraft, as shown in a formula (4):
d PE ≤R P (4)
wherein ,dPE Is the distance R between the pursuing spacecraft P and the escaping spacecraft E P Refers to the pursuit action range of the pursuit spacecraft P;
the condition of the scene task failure is set as that the weapon ranges of the pursuit spacecraft P and the threat T overlap, namely, the distance between the pursuit spacecraft and the threat is smaller than the safety distance, as shown in a formula (5):
Figure SMS_25
wherein ,
Figure SMS_26
refers to the distance between the pursuit spacecraft and each threat; l is the safe distance between the pursuit spacecraft and the threat, and is defined as the sum of the action ranges of the pursuit spacecraft and the threat, namely +.>
Figure SMS_27
Wherein, according to the formula (3),
Figure SMS_28
representing the ith threat T i Threat coverage of (2);
step 2, respectively designing a state space model of a pursuit spacecraft, an escape spacecraft and a threat according to the scene model established in the step 1, and a motion space model of the pursuit spacecraft and the escape spacecraft and a state transfer model of the pursuit spacecraft and the escape spacecraft, wherein the step 2 specifically comprises the following steps:
step 2.1: state space model design of a pursuit spacecraft, an escape spacecraft and a threat, and the design is specific:
setting a pursuit spacecraft, an escape spacecraft, threat airborne GPS equipment and a gyroscope, and acquiring own position information and speed information; the following aircraft airborne fire control radar load equipment can acquire the position information and the speed information of a target, as shown in a formula (6):
f(i)=[x i ,y i ,v ii ]i=P,E,T (6)
the method uses the relative information relation to establish a State space model State, so that the measurement space size is compressed, the input processing pressure of the neural network is reduced to improve the algorithm performance, the algorithm is focused on learning a solving scheme, and the expression design of the State space model is shown as a formula (7):
Figure SMS_29
wherein ,dPE Refers to the distance between the pursuing spacecraft and the escaping spacecraft,
Figure SMS_30
refers to the distance between the pursuit spacecraft and each threat, alpha PE Is the angle between the velocity direction of the pursuing spacecraft P and the escape spacecraft E and the target line of sight LOS, +.>
Figure SMS_31
Is to chase spacecraft P and each threat T i The angle between the velocity direction of the target line of sight, LOS, is the vector direction pointing from the position of the pursuit spacecraft to the target.
Step 2.2: design of action space models of a pursuing spacecraft and an escape spacecraft, and specifically:
the control input of the pursuing spacecraft is designed to be angular velocity and acceleration, in order to focus an algorithm on learning a solving scheme, the invention assumes that a dynamics equation of the spacecraft sets the spacecraft to uniform motion, and Action space actions are shown as a formula (8):
Action=[ω] (8)
the method adopts the angular velocity of the common spacecraft to design an action space, sets the maximum angular velocity of the spacecraft to be less than 25.5rad/sec, namely omega E < -25.5,25.5 >, and takes the anticlockwise direction of the top view as the positive direction.
Step 2.3: design of state transition models of a pursuing spacecraft and an escape spacecraft, and specifically:
the spacecraft motion state transfer equation is shown in formula (9):
Figure SMS_32
wherein i refers to a pursuit spacecraft and an escape spacecraft. The spacecraft is in the current state s t Take action A t Acquiring the change quantity of state transition through interaction with a scene, and acquiring the current state s t Added to the variation to obtain the next state s t+1
Step 3, according to the established spacecraft pursuit task scene, a layering simplified model is established for the spacecraft pursuit task through a layering control method, so that the spacecraft pursuit task is simplified into a multi-layer subtask, and the method specifically comprises the following steps:
firstly, the first-stage task refers to that when the pursuing spacecraft does not receive threat signals from measurement information of the environment, the pursuing spacecraft is required to continuously move towards the escaping spacecraft under the driving of a designed autonomous motion planning model.
Secondly, the second stage task refers to when the pursuing spacecraft receives threat information from the measurement information of the environment, namely, when the threat appears in the LOS sight of the pursuing spacecraft to the target, the current task of the pursuing spacecraft is to execute a round-the-fly evasion maneuver to the threat. At this time, the pursuit spacecraft should fly into the flight track set by the LGVF method, then execute the fly-around action along the track, and finally cut out the current second-stage task and cut into the first-stage task when the pursuit target, the pursuit spacecraft and the threat form an obtuse angle, that is, the pursuit spacecraft has successfully flown around the threat, and continue to move to the escape spacecraft.
According to the measurement information of the pursuit spacecraft on the current environment, a layering simplified model is built for the pursuit spacecraft task through a layering control method, so that the pursuit spacecraft task is simplified into the two-stage task.
Step 4, according to the layered simplified model of the established spacecraft pursuit task, an improved depth reinforcement learning flexible actor critic (SAC) algorithm is provided to establish an autonomous motion planning control framework, the capability of processing dynamic uncertain states is provided for the pursuit spacecraft, and the current dynamic pursuit scene needing real-time optimization is satisfied, specifically:
the autonomous motion planning control architecture is established through the SAC algorithm of deep reinforcement learning, the end-to-end algorithm property of the autonomous motion planning control architecture can enable the pursuit spacecraft to learn the common characteristics of the problem in a training mode, and meanwhile, an offline training model can be directly used for online test application, so that the capability of processing a dynamic uncertain state is provided, and the current dynamic pursuit scene requiring real-time optimization is met. The self-adaptive entropy coefficient according to different subtasks of the pursuit spacecraft improves the learning efficiency of the SAC algorithm and the tracking precision of the pursuit spacecraft.
For a general deep reinforcement learning algorithm, the learning target is to learn a strategy model so as to maximize the accumulated expected value of rewards obtained by interaction of the pursuing spacecraft with the environment, and the form is as shown in a formula (10):
Figure SMS_33
wherein ,R(st ,a t ) Representing in state s t Take action a t Obtained return value, policy pi * The goal of (2) is to maximize the overall return value desired. The SAC algorithm belongs to the maximum entropy strategy reinforcement learning algorithm, namely, besides the above basic objective, the action entropy (entropy) of each output of the strategy is required to be maximum, and the form is shown as a formula (11):
Figure SMS_34
wherein alpha is entropy systemThe number, H (pi (|s) t ) The objective of requiring maximum entropy is mainly to randomize the strategy, i.e. the probability of each action output is as balanced as possible, meaning that the pursuit spacecraft needs to explore all possible optimal paths. The pursuit spacecraft is not only learned but as many ways to complete tasks as possible, and the learned strategy can be suitable for more complex specific tasks; secondly, the method is more robust, and the pursuit spacecraft can explore various optimal paths from different actions, so that adjustment can be made more easily when the method faces interference.
By utilizing the characteristics of entropy items, the method and the device can adaptively adjust the entropy coefficients according to different stages of tasks where the pursuit spacecraft is located. Firstly, when the pursuit spacecraft does not receive a threat signal from the measurement information of the environment, the pursuit spacecraft should fly into the target continuously, and the pursuit spacecraft should be encouraged to actively explore the optimal track through the maximum entropy strategy, and the automatically adjusted maximum entropy coefficient using the SAC algorithm is reserved at this stage. Secondly, when the pursuit spacecraft receives threat information from the measurement information of the environment, the pursuit spacecraft should fly around the flying track preset by the track controller, and attenuation is realized by multiplying the entropy coefficient by a constant at this stage, so that exploratory behaviors of the pursuit spacecraft are limited, and the pursuit spacecraft can fly stably according to the preset track.
Step 5, forming a combined method by introducing a Lithospermol guiding vector field method under an improved SAC autonomous motion planning control architecture according to a layered simplified model of an established spacecraft pursuit task, optimizing a learning process of an SAC algorithm, and forming the combined optimization method, wherein the specific process is as follows:
aiming at the situation that the threat randomly appears in the LOS view angle in a complex scene, the pursuit spacecraft is required to carry out evasion maneuver decision in real time. At this time, a trajectory controller designed according to the Lyapunov guidance vector field algorithm is used for threat T i And establishing a vector field model, and designing an evasion track of the pursuit spacecraft to guide the pursuit spacecraft to stably fly around the threat.
The design return function is in the form of formula (12):
r=r a +r b +r c (12)
wherein r represents the total return value, r a Represents a sparse return function, in particular a return given when the pursuit spacecraft successfully captures the target and a return given when the pursuit spacecraft is intercepted (negative value), the form of which is shown in formula (13):
Figure SMS_35
r b the method is one of designed guiding type rewarding return components, the included angle between the speed direction of each iterative step pursuing spacecraft and escaping spacecraft and the target sight LOS is calculated in the whole process, and the weight mu is used for calculating the target sight LOS 1 The dimension is controlled and then added to the total return value, and the form is shown as a formula (14):
r b =-μ 1 α PE (14)
r c the designed guiding type rewarding return component is two, the process of the pursuing spacecraft flying to the escape spacecraft is divided into two stages, and the form is shown as a formula (15):
Figure SMS_36
the stage 1 refers to that the pursuing spacecraft does not sense the existence of threat, namely, when the pursuing spacecraft is in a first-stage task, the pursuing spacecraft continuously moves towards the escaping spacecraft under the driving of a designed autonomous motion planning model, and r is as follows c The guiding function is not exerted; stage 2 refers to when a threat occurs in a path where a vector of the pursuit spacecraft points to the escape spacecraft, that is, the pursuit spacecraft is in a second stage task, a vector field model is built for the threat T according to a lyapunov guidance vector field algorithm, and the radius of the vector field is designed to be d PT +w P, wherein wP Refers to pursuing the fuselage width of the spacecraft, and threat level is expressed in the form of an exponential function. At this time r c Guiding the distance between the control of the pursuit spacecraft and the threatOtherwise, the generated negative return value passes through the weight mu 2 And controlling the report value dimension and adding the result to the total report value as punishment.
Step 6, applying the combined optimization method to an established spacecraft pursuit task scene to train an autonomous motion planning model of the pursuit spacecraft, wherein the specific process comprises the following steps:
Step 6.1: loading a spacecraft pursuit task into a deep reinforcement learning algorithm, and specifically:
in the established spacecraft pursuit task scene, the pursuit spacecraft is in the current state s t Take action a t Transition to the next state s t+1 Is considered a markov process of incoming actions and rewards in reinforcement learning. Five-tuple according to a Markov process model
Figure SMS_37
S is the State space in the process, namely the State, A is the Action set, namely the Action,>
Figure SMS_38
is a time series set, p is a state transition probability function, and gamma is a return function when the state steps.
At each decision time t, a state transition probability matrix p is obtained in a limited action space t As shown in equation (16):
p t =[p(s′|s,a 1 )...p(s′|s,a N )] (16)
wherein s is the current moment state of the pursuit spacecraft, and the action { a } is executed 1 ,...,a N Entering a new state s'; a, a i For the ith action in the action space, N is the total number of actions in the action space, p (s' |s, a) i ) Action a is executed on behalf of the pursuit spacecraft i Probability of reaching the new state s'. Return matrix R generated by interaction of pursuit spacecraft and environment i (s, a) and the total return value function R (s, a) formed by the same are shown in the formula (17) and the formula (18):
R i (s,a)=γ(s i ,a i ) (17)
R(s,a)=Σ t γ(s t ,a t ) (18)
the total return function is combined with the maximum entropy strategy in claim 4 to obtain the SAC algorithm optimization objective as shown in the above formula (11).
Step 6.2: training an autonomous motion planning model of the pursuit spacecraft, and specifically:
by adjusting parameters of the network, an autonomous motion planning model is perfected, and actions of the pursuing spacecraft in different states are guided; the training time of the model is shortened and the training efficiency of the model is maximized by adjusting the learning rate; guiding the optimizing process of the pursuit spacecraft in a huge action space by adjusting the weight of each return in the return function; by adjusting the entropy coefficient, the algorithm learning efficiency and tracking accuracy of the pursuit spacecraft are improved. Through the adjustment, the total return value of the algorithm tends to be converged to a constant steadily along with the increase of training times.
And 7, loading the autonomous motion planning model of the pursuit spacecraft trained in the step 6 into an online pursuit task scene with observable and unpredictable information parts, and perfecting the combined optimization method through test effect feedback of the pursuit spacecraft execution pursuit task. Firstly, adjusting parameters of a SAC algorithm network model and an autonomous motion planning framework structure according to feedback results, adapting to a spacecraft pursuit task scene, and perfecting the autonomous motion planning framework; secondly, adjusting the parameters of the LGVF model according to the feedback result, so that the pursuit spacecraft designed by the vector field has more stable avoidance track to the threat and is more suitable for the spacecraft pursuit task scene; finally, algorithm parameters are adjusted according to the feedback result and the model training scheme in the step 6.2, and the model training process is executed again. The combination optimization method is perfected by the feedback optimization means of the algorithm.
With the development of artificial intelligence technology in recent years, the deep learning technology breaks the barriers of the traditional method in many fields, and makes remarkable breakthrough progress, and as an important branch of deep learning, the deep reinforcement learning is mainly used for making sequential decisions, namely making action selection according to the current environmental state and continuously adjusting a model according to feedback of the action, so as to achieve the set target. The combined optimization problem is optimally selected in a discrete decision space, has the characteristics similar to the natural characteristic of 'action selection' of reinforcement learning, and the characteristics of 'offline training and online decision' of deep reinforcement learning enable the online real-time solution of the combined optimization problem to be realized, so that the combined optimization problem is well selected by utilizing a deep reinforcement learning method.
In consideration of the complexity problems of unpredictability, partial observability and the like of external information in a scene in actual application, the scene is difficult to solve by directly establishing a mathematical model, and the optimality of directly outputting the solution by a learned non-model method is difficult to ensure. In order to ensure the flexibility and reliability of the solving method, a better scheme is to solve the problem through a combined optimization method constructed in a combined learning mode. Firstly, dividing tasks in a pursuit scene into different stages by using a hierarchical control method, and establishing a hierarchical simplified model; secondly, an improved deep reinforcement learning flexible actor critic (SAC) algorithm is provided to establish an autonomous motion planning control framework, so that the capability of processing a dynamic uncertain state is provided for the pursuing spacecraft, and the current dynamic pursuing scene needing real-time optimization is met; finally, the li-apunov guided vector field method (LGVF) was introduced under the framework of the improved SAC algorithm to form a combined approach, compressing the solution space size to optimize the solution process in a huge solution space.
Through multiple tests and verification in a simulation environment, the SAC-LGVF combined optimization algorithm provided by the invention can reach a model convergence state through 90000steps of training in each initialization scene, the original SAC algorithm is difficult to converge in scenes with the same layout, and the convergence degree is reduced due to the fact that the training condition is too early and the fitting condition is too early; in addition, as shown in table 1, the two trained models are respectively tested in the same simulation environment, the task execution success of the spacecraft controlled by the original SAC algorithm is low, and the task execution success rate of the spacecraft controlled by the SAC-LGVF combined optimization algorithm is obviously improved.
TABLE 1
Algorithm Success rate Convergence speed
LGVF-SAC 87% 90000steps
Original SAC 26% Difficult to converge, overfit
The invention simplifies the scene complexity by layering and simplifying the escape scene of the spacecraft, improves the algorithm learning capability and the task execution capability by combining the improved SAC algorithm with the LGVF, not only provides autonomy and dynamic pursuit capability for the spacecraft, but also ensures the control precision and the task success rate of the autonomous execution task of the spacecraft.

Claims (7)

1. The spacecraft pursuit task combination optimal control method based on SAC and LGVF is characterized by comprising the following steps:
Step 1: the method comprises the following specific steps of establishing a spacecraft pursuit task scene model:
the invention relates to a spacecraft pursuit task in a dynamic unknown environment, which is described as a process of pursuing a threat existing in a scene where a spacecraft needs to fly and pursuing the spacecraft for dynamic escape, and an optimization function model is established for the problem as shown in a formula (1):
min t c =G[f(P),f(E),f(T i )] (1)
objective function t c Refers to the object of the pursuit spacecraft P to capture the escape spacecraft E in the shortest time, G [ f (P), f (E), f (T) i )]Refers to the fusion of scene overall measurement information, f (P), f (E), f (T) i ) Respectively refers to state information of a pursuit spacecraft, an escape spacecraft and each threat T i Representing an ith threat;
the dynamic differential model of the pursuit spacecraft and the escape spacecraft is built for the scene as shown in a formula (2):
Figure FDA0004093638100000011
wherein ,xi ,y i As the current position information of the spacecraft,
Figure FDA0004093638100000012
respectively x i ,y i I.e. the component of the velocity in the direction of the two vectors; v i Representing the speed of a spacecraft, a i Representing acceleration of spacecraft, ++>
Figure FDA0004093638100000013
Is the velocity v i Is equal to the acceleration a i ;ψ i Representing heading angle omega of spacecraft i Representing the angular velocity of a spacecraft,/->
Figure FDA0004093638100000014
Is heading angle psi i The differential quantity of (a) i.e. the course angle change rate, is equal to the angular velocity omega i . The angular velocity value of the pursuing spacecraft depends on the model output of the reinforcement learning algorithm;
the state and initial state of the spacecraft and threat are shown in formula (3):
Figure FDA0004093638100000015
wherein ,xi0 ,y i0 Representing the initial position of a spacecraft or threat, deltax i ,Δy i Representing the variation of the position of the spacecraft or threat, x i (t 0 ),y i (t 0 ) Representing the space vehicle or threat from an initial position x i0 ,y i0 Through Deltax i ,Δy i A new position generated by a change in displacement of (a); v i0 Representing initial speed of spacecraft or threat, v i (t 0 ) Represents any t 0 Speed of the spacecraft or threat at the moment; psi phi type i0 Initial angle, Δω, representing heading angle of spacecraft or threat i Represents the change in heading angle of a spacecraft or threat i (t 0 ) Heading angle representing spacecraft or threat is defined by an initial angle ψ i0 Through Deltaomega i A new heading angle generated by an angle change of (a); r is R i0 Representing the fire power action range of spacecraft or threat initialization, R i (t 0 ) Representing a spacecraft or threat at any t 0 Range of fire power action at moment and range of initial fire power action R i0 Remain unchanged; the individual threats are randomly distributed in the scene to simulate scene complexity;
the condition that the scene task is successful is set as that the distance between the pursuit spacecraft P and the escape spacecraft E is smaller than the pursuit range of the pursuit spacecraft, as shown in a formula (4):
d PE ≤R P (4)
wherein ,dPE Is the distance R between the pursuing spacecraft P and the escaping spacecraft E P Refers to the pursuit action range of the pursuit spacecraft P;
the condition of the scene task failure is set as that the weapon ranges of the pursuit spacecraft P and the threat T overlap, namely, the distance between the pursuit spacecraft and the threat is smaller than the safety distance, as shown in a formula (5):
Figure FDA0004093638100000024
wherein ,
Figure FDA0004093638100000021
refers to the distance between the pursuit spacecraft and each threat; l is the safe distance between the pursuit spacecraft and the threat, and is defined as the sum of the action ranges of the pursuit spacecraft and the threat, namely +.>
Figure FDA0004093638100000022
Wherein, according to formula (3), the ++>
Figure FDA0004093638100000023
Representing the ith threat T i Threat coverage of (2);
step 2: according to the spacecraft pursuit task scene model established in the step 1, respectively designing a state space model, an action space model and a state transition model of the pursuit spacecraft and the escape spacecraft;
step 3: according to the established spacecraft pursuit task scene model in the step 1, a layering simplified model is established for the spacecraft pursuit task through a layering control method, so that the spacecraft pursuit task is simplified into a multi-level subtask;
step 4: according to the layering simplified model of the established spacecraft pursuit task, an improved deep reinforcement learning flexible actor critique algorithm is provided to establish an autonomous motion planning control framework, so that the capability of processing a dynamic uncertain state is provided for the pursuit spacecraft, and the current dynamic pursuit scene needing real-time optimization is met;
Step 5: according to the layering simplified model of the established spacecraft pursuit task, under the autonomous motion planning control architecture of an improved SAC algorithm, a combination method is formed by introducing a Liapunov guide vector field method, and a SAC algorithm learning process is optimized to form a combination optimization method;
step 6: applying the combined optimization method in the step 5 to the established spacecraft pursuit task layering simplified model in the step 4, and training an autonomous motion planning model of the pursuit spacecraft;
step 7: and (3) loading the autonomous motion planning model of the pursuit spacecraft trained in the step (6) into an online pursuit task simulation scene with observable and unpredictable information part for testing, and perfecting the combined optimization method through the test effect feedback of the pursuit spacecraft for executing the pursuit task.
2. The spacecraft pursuit task combination optimization control method based on SAC and LGVF is characterized in that the step 2 specifically comprises the following steps:
step 2.1: state space model design of a pursuit spacecraft, an escape spacecraft and a threat, and the design is specific:
setting a pursuit spacecraft, an escape spacecraft, threat airborne GPS equipment and a gyroscope, and acquiring own position information and speed information; the following aircraft airborne fire control radar load equipment can acquire the position information and the speed information of a target, as shown in a formula (6):
f(i)=[x i ,y i ,v ii ]i=P,E,T (6)
The method uses the relative information relation to establish a State space model State, so that the measurement space size is compressed, the input processing pressure of the neural network is reduced to improve the algorithm performance, the algorithm is focused on learning a solving scheme, and the expression design of the State space model is shown as a formula (7):
Figure FDA0004093638100000031
wherein ,dPE Refers to the distance between the pursuing spacecraft and the escaping spacecraft,
Figure FDA0004093638100000032
refers to the distance between the pursuit spacecraft and each threat, alpha PE Is the angle between the velocity direction of the pursuing spacecraft P and the escape spacecraft E and the target line of sight LOS, +.>
Figure FDA0004093638100000033
Is to chase spacecraft P and each threat T i An included angle between a velocity direction of the target line of sight LOS, the LOS refers to a vector direction pointing to the target from the position of the pursuit spacecraft;
step 2.2: design of action space models of a pursuing spacecraft and an escape spacecraft, and specifically:
the control input of the pursuing spacecraft is designed to be angular velocity and acceleration, the spacecraft is set to uniform motion by a kinetic equation of the spacecraft, and the Action space Action is shown as a formula (8):
Action=[ω] (8)
setting the maximum angular velocity of the spacecraft to be less than 25.5rad/sec, namely omega E < -25.5,25.5 >, and taking the anticlockwise direction of the top view as the positive direction;
step 2.3: design of state transition models of a pursuing spacecraft and an escape spacecraft, and specifically:
The spacecraft motion state transfer equation is shown in formula (9):
Figure FDA0004093638100000041
wherein i refers to a pursuit spacecraft and an escape spacecraft; the spacecraft is in the current state s t Take action A t Acquiring the change quantity of state transition through interaction with a scene, and acquiring the current state s t Added to the variation to obtain the next state s t+1
3. The method for optimizing and controlling the combination of the spacecraft pursuit tasks based on the SAC and the LGVF according to claim 1 is characterized in that the step 3 is to build a layered simplified model for the spacecraft pursuit tasks by a hierarchical control method according to the established spacecraft pursuit task scene model in the step 1, so that the spacecraft pursuit tasks are simplified into multi-level subtasks, which is specifically as follows:
firstly, a first-stage task refers to that when the pursuit spacecraft does not receive threat signals from measurement information of the environment, the pursuit spacecraft is required to continuously move towards the escape spacecraft under the driving of a designed autonomous motion planning model;
secondly, the second-stage task refers to when the pursuing spacecraft receives threat information from the measurement information of the environment, namely, when the threat appears in the LOS sight of the pursuing spacecraft to the target, the current task of the pursuing spacecraft is to execute a round-the-fly evasion maneuver to the threat; at this time, the pursuit spacecraft should fly into the flight track set by the LGVF method, then execute the fly-around action along the track, and finally cut out the current second-stage task and cut into the first-stage task when the pursuit target, the pursuit spacecraft and the threat form an obtuse angle, that is, the pursuit spacecraft has successfully flown around the threat, and continue to move to the escape spacecraft.
4. The spacecraft pursuit task combination optimization control method based on SAC and LGVF according to claim 1, wherein the step 4 is specifically:
the autonomous motion planning control architecture is established through the SAC algorithm of deep reinforcement learning, the end-to-end algorithm property of the autonomous motion planning control architecture can enable the pursuit spacecraft to learn the common characteristics of the problem in a training mode, and meanwhile, an offline training model can be directly used for online test application, so that the capability of processing a dynamic uncertain state is provided, and the current dynamic pursuit scene requiring real-time optimization is met; the self-adaptive entropy coefficient according to different subtasks of the pursuit spacecraft improves the learning efficiency of the SAC algorithm and the tracking precision of the pursuit spacecraft;
for a general deep reinforcement learning algorithm, the learning target is to learn a strategy so as to maximize the accumulated expected value of rewards obtained by interaction of the pursuing spacecraft with the environment, and the form is as shown in a formula (10):
Figure FDA0004093638100000042
wherein ,R(st ,a t ) Representing in state s t Take action a t Obtained return value, policy pi * The goal of (2) is to maximize the overall return value desired; the SAC algorithm belongs to a maximum entropy strategy reinforcement learning algorithm, namely, besides the basic target, the action entropy of each output of the strategy is required to be maximum, and the form is shown as a formula (11):
Figure FDA0004093638100000051
/>
Where α is the entropy coefficient, H (pi (|s) t ) The objective of requiring maximum entropy is mainly to randomize the strategy, i.e. the probability of each action output is balanced as much as possible, and the pursuit spacecraft needs to explore all possible optimal paths.
5. The spacecraft pursuit task combination optimization control method based on SAC and LGVF according to claim 1, wherein the specific process in step 5 is as follows:
aiming at the situation that the threat randomly appears in the LOS view angle in a complex scene, the pursuit spacecraft is required to carry out evasion maneuver decision in real time: at this time, a trajectory controller designed according to the Lyapunov guidance vector field algorithm is used for threat T i Establishing a vector field model, and designing an evasion track of the pursuit spacecraft to guide the pursuit spacecraft to stably fly around the threat;
the design return function is in the form of formula (12):
r=r a +r b +r c (12)
wherein r represents the total return value, r a The sparse return function is represented, specifically, the return given by the pursuit spacecraft when successfully capturing the target and the return given by the pursuit spacecraft when intercepted are shown in the form of a formula (13):
Figure FDA0004093638100000052
r b is one of the designed guided rewards component, in the whole processCalculating the included angle between the speed direction of each iterative step pursuit spacecraft and the escape spacecraft and the target sight line LOS, and passing the weight mu 1 The dimension is controlled and then added to the total return value, and the form is shown as a formula (14):
r b =-μ 1 α PE (14)
r c the designed guiding type rewarding return component is two, the process of the pursuing spacecraft flying to the escape spacecraft is divided into two stages, and the form is shown as a formula (15):
Figure FDA0004093638100000053
the stage 1 refers to that the pursuing spacecraft does not sense the existence of threat, namely, when the pursuing spacecraft is in a first-stage task, the pursuing spacecraft continuously moves towards the escaping spacecraft under the driving of a designed autonomous motion planning model, and r is as follows c The guiding function is not exerted; stage 2 refers to when a threat occurs in a path where a vector of the pursuit spacecraft points to the escape spacecraft, that is, the pursuit spacecraft is in a second stage task, a vector field model is built for the threat T according to a lyapunov guidance vector field algorithm, and the radius of the vector field is designed to be d PT +w P, wherein wP The method refers to the pursuit of the fuselage width of the spacecraft, and threat level is expressed in an exponential function form; at this time r c Guiding the distance between the control of the pursuit spacecraft and the threat, otherwise, generating a negative return value through the weight mu 2 And controlling the report value dimension and adding the result to the total report value as punishment.
6. The method for optimizing and controlling the combination of the following tasks of the spacecraft based on SAC and LGVF according to claim 1, wherein the step 6 is characterized in that the method for optimizing the combination is applied to the established scene of the following tasks of the spacecraft, and the autonomous motion planning model of the following steps are trained:
Step 6.1: loading a spacecraft pursuit task into a deep reinforcement learning algorithm, and specifically:
in the established spacecraft pursuit task scene, the pursuit spacecraft is in the current state s t Take action a t Transition to the next state s t+1 A Markov process that is considered as an introduced action and reward in reinforcement learning; the five-tuple (S, a,
Figure FDA0004093638100000061
p, γ) description, S is the State space in the process, i.e. State, A is the Action set, i.e. Action, +.>
Figure FDA0004093638100000062
Is a time sequence set, p is a state transition probability function, and gamma is a return function when the state steps;
at each decision time t, a state transition probability matrix p is obtained in a limited action space t As shown in equation (16):
p t =[p(s′|s,a 1 )...p(s′|s,a N )] (16)
wherein s is the current moment state of the pursuit spacecraft, and the action { a } is executed 1 ,...,a N Entering a new state s'; a, a i For the ith action in the action space, N is the total number of actions in the action space, p (s' |s, a) i ) Action a is executed on behalf of the pursuit spacecraft i Probability of reaching the new state s'; return matrix R generated by interaction of pursuit spacecraft and environment i (s, a) and the total return value function R (s, a) formed by the same are shown in the formula (17) and the formula (18):
R i (s,a)=γ(s i ,a i ) (17)
R(s,a)=∑ t γ(s t ,a t ) (18)
The total return function is combined with the maximum entropy strategy in the claim 4 to obtain the SAC algorithm optimization target as shown in the formula (11);
step 6.2: training an autonomous motion planning model of the pursuit spacecraft, and specifically:
by adjusting parameters of the network, an autonomous motion planning model is perfected, and actions of the pursuing spacecraft in different states are guided; the training time of the model is shortened and the training efficiency of the model is maximized by adjusting the learning rate; guiding the optimizing process of the pursuit spacecraft in a huge action space by adjusting the weight of each return in the return function; the learning efficiency of the algorithm is improved and tracking precision of the pursuit spacecraft is improved by adjusting the entropy coefficient; through the adjustment, the total return value of the algorithm tends to be converged to a constant steadily along with the increase of training times.
7. The method for optimizing and controlling the combination of the following tasks of the spacecraft based on the SAC and the LGVF according to claim 6, wherein the step 7 specifically comprises the following steps:
firstly, adjusting parameters of a SAC algorithm network model and an autonomous motion planning framework structure according to feedback results, adapting to a spacecraft pursuit task scene, and perfecting the autonomous motion planning framework in the step 4;
Secondly, adjusting the LGVF model parameters in the step 5 according to the feedback result, so that the pursuit spacecraft designed by the vector field has more stable avoidance track to the threat and is more suitable for the spacecraft pursuit task scene;
finally, parameters of the autonomous motion planning model in the step 6.2 are adjusted according to feedback results, and model training efficiency and tracking accuracy of the pursuit spacecraft are improved.
The combination optimization method is perfected by the feedback optimization means of the algorithm.
CN202310159415.2A 2023-02-23 2023-02-23 Spacecraft pursuit task combination optimization control method based on SAC and LGVF Pending CN116107213A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310159415.2A CN116107213A (en) 2023-02-23 2023-02-23 Spacecraft pursuit task combination optimization control method based on SAC and LGVF

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310159415.2A CN116107213A (en) 2023-02-23 2023-02-23 Spacecraft pursuit task combination optimization control method based on SAC and LGVF

Publications (1)

Publication Number Publication Date
CN116107213A true CN116107213A (en) 2023-05-12

Family

ID=86258005

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310159415.2A Pending CN116107213A (en) 2023-02-23 2023-02-23 Spacecraft pursuit task combination optimization control method based on SAC and LGVF

Country Status (1)

Country Link
CN (1) CN116107213A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117350326A (en) * 2023-11-29 2024-01-05 北京航空航天大学 Multi-machine trapping method and device for hierarchical collaborative learning, electronic equipment and medium
CN117434968A (en) * 2023-12-19 2024-01-23 华中科技大学 Multi-unmanned aerial vehicle escape-tracking game method and system based on distributed A2C

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117350326A (en) * 2023-11-29 2024-01-05 北京航空航天大学 Multi-machine trapping method and device for hierarchical collaborative learning, electronic equipment and medium
CN117350326B (en) * 2023-11-29 2024-04-09 北京航空航天大学 Multi-machine trapping method and device for hierarchical collaborative learning, electronic equipment and medium
CN117434968A (en) * 2023-12-19 2024-01-23 华中科技大学 Multi-unmanned aerial vehicle escape-tracking game method and system based on distributed A2C
CN117434968B (en) * 2023-12-19 2024-03-19 华中科技大学 Multi-unmanned aerial vehicle escape-tracking game method and system based on distributed A2C

Similar Documents

Publication Publication Date Title
CN116107213A (en) Spacecraft pursuit task combination optimization control method based on SAC and LGVF
Jiandong et al. UAV cooperative air combat maneuver decision based on multi-agent reinforcement learning
CN113791634B (en) Multi-agent reinforcement learning-based multi-machine air combat decision method
McGrew et al. Air-combat strategy using approximate dynamic programming
CN113467508B (en) Multi-unmanned aerial vehicle intelligent cooperative decision-making method for trapping task
CN112947581A (en) Multi-unmanned aerial vehicle collaborative air combat maneuver decision method based on multi-agent reinforcement learning
CN112198892B (en) Multi-unmanned aerial vehicle intelligent cooperative penetration countermeasure method
CN111580544A (en) Unmanned aerial vehicle target tracking control method based on reinforcement learning PPO algorithm
CN114330115B (en) Neural network air combat maneuver decision-making method based on particle swarm search
CN110928329A (en) Multi-aircraft track planning method based on deep Q learning algorithm
CN111240345A (en) Underwater robot trajectory tracking method based on double BP network reinforcement learning framework
CN113282061A (en) Unmanned aerial vehicle air game countermeasure solving method based on course learning
Ruan et al. Autonomous maneuver decisions via transfer learning pigeon-inspired optimization for UCAVs in dogfight engagements
CN115688268A (en) Aircraft near-distance air combat situation assessment adaptive weight design method
Yeh Attitude controller design of mini-unmanned aerial vehicles using fuzzy sliding-mode control degraded by white noise interference
CN115033022A (en) DDPG unmanned aerial vehicle landing method based on expert experience and oriented to mobile platform
CN115903865A (en) Aircraft near-distance air combat maneuver decision implementation method
Duan et al. Autonomous maneuver decision for unmanned aerial vehicle via improved pigeon-inspired optimization
Dong et al. Trial input method and own-aircraft state prediction in autonomous air combat
CN113741186B (en) Double-aircraft air combat decision-making method based on near-end strategy optimization
Li et al. Manoeuvre decision‐making of unmanned aerial vehicles in air combat based on an expert actor‐based soft actor critic algorithm
Abdelkader et al. Distributed real time control of multiple uavs in adversarial environment: Algorithm and flight testing results
CN116225065A (en) Unmanned plane collaborative pursuit method of multi-degree-of-freedom model for multi-agent reinforcement learning
CN116796843A (en) Unmanned aerial vehicle many-to-many chase game method based on PSO-M3DDPG
CN116697829A (en) Rocket landing guidance method and system based on deep reinforcement learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination