CN116956987B

CN116956987B - On-line track optimization method for sub-track hypersonic carrier based on reinforcement learning-particle swarm hybrid optimization

Info

Publication number: CN116956987B
Application number: CN202310946041.9A
Authority: CN
Inventors: 周宏宇; 刘芳; 方艺忠; 刘佳琪
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2023-07-28
Filing date: 2023-07-28
Publication date: 2024-03-26
Anticipated expiration: 2043-07-28
Also published as: CN116956987A

Abstract

An online trajectory optimization method of a sub-orbit hypersonic carrier based on reinforcement learning-particle swarm hybrid optimization belongs to the field of trajectory optimization and optimization control of aircrafts. The method solves the problems of low efficiency and low precision of the existing online track optimization method. The invention explores the optimizing mechanism of the particle swarm optimization method, utilizes the reinforcement learning agent to actively control the movement trend of particles, so that the agent has the capability of autonomously determining the optimizing direction according to the optimizing process of the nonlinear optimization problem, greatly improves the optimizing performance of the particle swarm optimization method, improves the precision of the online track optimization method, and introduces the reinforcement learning method to obviously improve the efficiency of the online track optimization method. The method can be applied to online track optimization of the sub-track hypersonic carrier.

Description

On-line track optimization method for sub-track hypersonic carrier based on reinforcement learning-particle swarm hybrid optimization

Technical Field

The invention belongs to the field of optimization and optimization control of aircraft trajectory, and particularly relates to an online trajectory optimization method of a sub-orbit hypersonic carrier.

Background

The reusable sub-orbital hypersonic carrier is one of the important development directions of aerospace science and technology, and is an important attack direction for human future development of global rapid arrival technology. The sub-track hypersonic carrier can horizontally take off and land from a runway like an airplane, can be switched between adjacent space and outer space at will, and has higher economical efficiency, flexibility and adaptability compared with the existing transport system.

Under the conditions of state disturbance, environment deviation, task change and the like, the optimal track of the sub-orbit hypersonic carrier needs to be planned on line. The existing planning problem mainly considers the traditional trajectory optimization constraint with analytical expression, and the sub-orbit hypersonic speed carrier has the large-cross-domain and fast-time-varying motion characteristic. Complex nonlinear coupling factors exist between multimode combined power, large lift-drag ratio aerodynamic shape and uncertain adjacent space environment of the sub-orbit hypersonic carrier, including power-aerodynamic-structure-load-thermal environment and the like, and corresponding multi-field coupling models and constraint models cannot be established in a resolving mode, namely a strong coupling nonlinear track optimization model cannot be established, so that new requirements are put forward on a track optimization method. And for a complex strong coupling nonlinear track optimization model, the existing online track optimization method has the problems of low efficiency, low precision and the like, and meanwhile, the existing hybrid optimization method mainly adopts a strategy of combining initial value search with rapid convergence, for example, initial values are generated by using particle swarm optimization (Particleswarm optimization, PSO), genetic algorithm and the like, and are further solved by using a pseudo-spectrum method, a convex optimization method and the like, but the optimization strategy does not fundamentally solve the existing defects of each optimization method, and only can primarily obtain the effects of compensating for the advantages.

Disclosure of Invention

The invention aims to solve the problems of low efficiency and low precision of the existing online track optimization method, and provides an online track optimization method for a sub-track hypersonic carrier based on reinforcement learning-particle swarm hybrid optimization.

The technical scheme adopted by the invention for solving the technical problems is as follows:

the method for optimizing the online track of the sub-track hypersonic carrier based on reinforcement learning-particle swarm hybrid optimization specifically comprises the following steps:

step 1, setting a learning factor c ₁ And c ₂ Setting a random number eta ₁ 、η ₂ And eta ₃ Is a value of (2);

setting the upper limit of training rounds of the reinforcement learning intelligent agent as M, and setting the upper limit of the particle swarm evolution algebra in a single round as M;

step 2, training the reinforcement learning agent;

step 21, initializing the position and speed of each particle in the particle swarm, and updating the position of the particle according to the initialized position and speed of the particle;

step 22, setting the round number to be 1;

step 23, let evolution algebra p=1;

step 24, calculating the state of the intelligent agent according to the position of each particle in the particle swarm;

step 25, randomly exploring actions of the intelligent agent;

step 26, utilizing the inertial weight l in the action of the agent _p Updating the position of each particle in the particle swarm;

step 27, calculating an agent reward according to the particle positions in the updated particle swarm; the evolution algebra p is increased by 1, and whether the evolution algebra is smaller than or equal to m is judged;

if yes, return to execute step 24;

otherwise, executing step 28;

step 28, the number of rounds is increased by 1, and whether the number of rounds is smaller than or equal to M is judged;

if yes, returning to the execution step 23;

if not, ending training;

and 3, carrying out online optimization on the track of the sub-track hypersonic carrier by using the trained reinforcement learning agent.

Further, the learning factor c ₁ =1.5, learning factor c ₂ =0.5, random number η ₁ ∈[0.5,1]Random number eta ₂ ∈[0.5,1]Random number eta ₃ ∈[0,1]，M＝2000，m＝30。

Further, the specific process of step 24 is as follows:

by M _pso Particle pair D _pso Searching by optimizing parameters, wherein the position vector of the ith particle in the search space is Representing the ith particle pair 1 st, 2 nd, …, D in the search space _pso Search results of the optimization parameters, the velocity vector of the ith particle in the search space isUpper corner markT stands for transpose, < >>Representing the ith particle pair 1 st, 2 nd, …, D in the search space _pso The speed of forward movement when searching for the optimum parameters i=1, 2, …, M _pso ，j＝1,2,…,D _pso ；

The historical best position vector for all particle findings in the whole particle population isWherein p is _g,1 Representing the search result of the historical best position vector for the 1 st optimization parameter, p _g,2 Representing the search result for the 2 nd optimization parameter in the historical best position vector,/for>Represents the D-th of the historical best position vectors _pso Search results of the optimization parameters, the historical best position vector of the i-th particle discovery is +.>Wherein p is _i,1 Search results for the 1 st optimization parameter in the historical best position vector representing the i-th particle discovery, p _i,2 Search results for the 2 nd optimization parameter in the historical best position vector representing the i-th particle discovery, are->The historical best position vector representing the i-th particle discovery is for the D-th _pso Search results of the individual optimization parameters;

all particles evolved in the manner of formula (1):

wherein t is _p Is the current evolution algebra;represents the t _p At evolution, the speed of forward movement of the ith particle in search space searching for the jth optimization parameter,/->Represents the t _p At generation-1 evolution, the speed of forward movement of the ith particle in the search space when searching for the jth optimization parameter, l _p Is inertial weight, p _i,j Is the historical best position of the jth optimization parameter found by the ith particle, p _g,j Is the historical best position of the j-th optimization parameter found by all particles, < >>Represents the t _p When evolving, search results of the ith particle on the jth optimized parameter position in the search space,/->Represents the t _p -upon evolution of generation 1, search results of the ith particle for the jth optimized parameter position in the search space;

introducing a random number eta ₃ Formula (1) is rewritten as formula (2):

wherein,represents the t _p When +1 generation evolves, the ith particle searches the search result of the jth optimizing parameter position in the search space;

obtaining a second order control system according to equation (2):

wherein omega _p Is the frequency of the second-order control system, xi _p Damping for a second order control system;

convergence time t of the particles _system The method comprises the following steps:

wherein ln represents natural logarithm, ε _p Is a set threshold value;

define agent state s as:

wherein G is _best P derived for current evolutionary algebra _g Corresponding performance index, G' _best P evolved for the previous generation _g Corresponding performance index, G _i P, the current algebra of evolution _i The corresponding performance metrics, avg () represents the calculated average, δ _swarm Is population diversity.

Further, the inertial weight l _p The value range of (1) is (0, 1)]。

Further, the threshold ε _p The value of (2) is 0.02.

Further, the method for calculating the population diversity comprises the following steps:

wherein,represents the t _p At generation 1 evolution, all particles search for the average position of the j-th optimization parameter in the search space.

Further, the convergence state of the particles is:

wherein E represents a mathematical expectation, p _i,j ＝p，p _g,j ＝g。

Further, the step 25 specifically includes:

defining an action a of the agent:

a＝[l _p ,D _w ] ^T (13)

wherein, the upper corner mark T represents transposition, D _w ＝±1；

When D is _w When=1, the agent makes particle dispersion decisions and makes the particle dispersion decisions based on ζ _p ≤ε _p Giving an inertial weight l _p Is a value of (2);

when D is _w When the energy value is = -1, the intelligent agent makes particle convergence decision and makes particle convergence decision according to ζ _p ＞ε _p Giving an inertial weight l _p Is used as a reference to the value of (a),

further, the specific process of step 27 is as follows:

wherein r represents the rewarding value of the agent, G _best Representing p after updating the particle positions in the particle swarm _g Corresponding performance indexes.

Further, the specific process of the step 3 is as follows:

step 31, initializing the position and speed of the particles, and updating the position of the particles according to the initialized position and speed of the particles;

step 32, updating control variables of the sub-orbital hypersonic carrier according to the updated particle position, wherein the control variables comprise attack angle and roll angle;

step 33, calculating the state of the intelligent agent according to the updated particle position;

step 34, inputting the state of the intelligent agent calculated in the step 33 to the trained intelligent agent, and outputting actions by the intelligent agent;

step 35, according to the inertia weight l in the output action of the agent _p Updating the position of the particles;

step 36, cycling the steps 32 to 35 until the sub-orbital hypersonic carrier reaches the target point predetermined by the task.

The beneficial effects of the invention are as follows:

the invention explores the optimizing mechanism of the particle swarm optimization method, utilizes the reinforcement learning agent to actively control the movement trend of particles, so that the agent has the capability of autonomously determining the optimizing direction according to the optimizing process of the nonlinear optimization problem, greatly improves the optimizing performance of the particle swarm optimization method, improves the precision of the online track optimization method, and introduces the reinforcement learning method to obviously improve the efficiency of the online track optimization method.

Drawings

FIG. 1 is an analysis chart of an optimizing process of the conventional particle swarm optimization method;

FIG. 2 is a schematic diagram of over-damped, under-damped, and divergent states under different inertial weights;

FIG. 3 is a graph showing the variation of damping, frequency and convergence time with inertial weight for different learning factors;

FIG. 4 is a schematic diagram of a training process of reinforcement learning agents for group optimization states.

Detailed Description

setting a reinforcement learning method to adopt a near-end strategy optimization (Proximal Policy Optimization, PPO) method;

step 2, training the reinforcement learning agent as shown in fig. 4;

step 21, initializing the position and the speed of each particle in the particle swarm, and updating the position of the particle according to the initialized position and speed of the particle (namely according to a formula (1));

step 22, setting the round number to be 1;

step 23, let evolution algebra p=1;

step 25, randomly exploring actions of the intelligent agent;

step 26, utilizing the inertial weight l in the action of the agent _p Updating the position of each particle in the population of particles (i.e. weighting the inertia i _p Substitution (2));

if yes, return to execute step 24;

otherwise, executing step 28;

if yes, returning to the execution step 23;

if not, ending training;

and for the current round, after calculating the agent rewards of each evolution in the current round, summing the agent rewards of each evolution in the current round, and adjusting the action selection of the next round according to the sum of the agent rewards of the current round.

As shown in FIG. 1, which is an analysis chart of an optimizing process of the existing particle swarm optimization method, the optimizing mechanism analysis method of the existing particle swarm optimization method provided by the invention can fundamentally give out key factors influencing the optimizing process and the motion trend of the particle swarm, give out the correct direction of the improvement of the method, and solve the problems of poor improving effect and insufficient combination of various methods. The invention utilizes the exploration mechanism of the reinforcement learning method to the environment, so that the reinforcement learning intelligent agent has the capability of autonomously determining the optimizing direction according to the optimizing process of the nonlinear optimizing problem, improves the solving capability and efficiency of the optimizing problem, and helps to realize the online planning and safe flight of the reusable sub-orbit carrier. The invention combines the particle swarm optimization method and the reinforcement learning method, and avoids the problems of high data requirement, poor generalization, poor scene applicability and the like of independent reinforcement learning (Reinforcement learning, RL). The optimization method can solve other complex nonlinear optimization problems.

The second embodiment is as follows: the present embodiment is different from the specific embodiment in that the learning factor c ₁ =1.5, learning factor c ₂ =0.5, random number η ₁ ∈[0.5,1]Random number eta ₂ ∈[0.5,1]Random number eta ₃ ∈[0,1]，M＝2000，m＝30。

Other steps and parameters are the same as in the first embodiment.

And a third specific embodiment: this embodiment is different from the first or second embodiment in that the specific process of step 24 is:

by M _pso Particle pair D _pso Searching by optimizing parameters, wherein the position vector of the ith particle in the search space isRepresenting the ith particle pair 1 st, 2 nd, …, D in the search space _pso Search results of optimization parameters D _pso The optimization parameters are all design variables included in the function of calculated angle of attack and roll angle, D _pso Substituting the optimized parameters into the function to obtain attack angle and tilting angle, wherein the speed vector of the ith particle in the search space is +.>The upper corner mark T represents the transpose, ">Representing the ith particle pair 1 st, 2 nd, …, D in the search space _pso The speed of forward movement when searching for the optimum parameters i=1, 2, …, M _pso ，j＝1,2,…,D _pso ；

all particles evolved in the manner of formula (1):

wherein t is _p For the current evolution algebra, the initial value is equal to 1;represents the t _p At evolution, the speed of forward movement of the ith particle in search space searching for the jth optimization parameter,/->Represents the t _p At generation-1 evolution, the speed of forward movement of the ith particle in the search space when searching for the jth optimization parameter, l _p Is inertial weight, p _i,j Is the historical best position of the jth optimization parameter found by the ith particle, p _g,j Is the historical best position of the j-th optimization parameter found by all particles, < >>Represents the t _p When evolving, search results of the ith particle on the jth optimized parameter position in the search space,/->Represents the t _p -upon evolution of generation 1, search results of the ith particle for the jth optimized parameter position in the search space;

introducing a random number eta ₃ Formula (1) is rewritten as formula (2):

a typical second order control system is obtained according to equation (2):

definition r ₄ ＝c ₁ η ₁ +c ₂ η ₂ -1, then r when the particle converges ₄ ＜l _p When the particles diverge r ₄ ＞l _p 。

Consider the following conventional particle swarm optimization approach: η (eta) _1,2 ∈[0,1]And c ₁ ＝c ₂ =2. Definition r ₄ ＝c ₁ η ₁ +c ₂ η ₂ -1, then r ₄ The range of the value of (C) is [ -1,3]At this time, there are two cases where decision action a of the agent is invalidated:

①r ₄ <0, no matter l _p Taking any value will cause the system to diverge (l) _p >0>r ₄ )；

②r ₄ >1, no matter l _p Taking any value will make the system converge (l) _p <1<r ₄ ). In both cases, any decision action a made by the agent does not affect r ₄ And l _p Is a size relationship of (a).

Thus, at 0<l _p <1 should ensure 0<r ₄ <1. Through analysis, the invention sets eta _1,2 ∈[0.5,1]And c ₁ ＝c ₂ =1, η compared to the conventional particle swarm optimization method _1,2 To 0.75, and c) ₁ ，c ₂ Is reduced to 1.0.

Further, consider the system over-damping condition ζ _p >1. From formula (3):

the finishing formula (4) can be obtained:

constructing a unitary quadratic function with an upward openingThen y (l) _p ) The root discriminant Δy of=0 is:

Δy＝(2r ₄ +4) ² -4(r ₄ ² -4c ₂ η ₃ )＝16+16r ₄ +16c ₂ η ₃ (6)

it can be seen that Deltay>0, function y (l _p ) There are two zero points, which are denoted as l _p0 ：

Taking into account l _p0 <1, eliminating larger zero point to obtainAt the same time consider 0<l _p0 ObtainingSo the system has over damping, i.e. xi _p >1 is->

Due toIf c ₂ =1 and η ₃ ∈[0,1]C is ₂ η ₃ ∈[0,1]At this time->The probability of (2) is low, i.e. only when eta ₃ Over-damping will occur at smaller values. Accordingly, even if the agent wishes to move the particles in a damped manner, it is likely that +.>And the decision action a of the agent is disabled.

In order to improve the effectiveness of the decision of the intelligent agent, the invention controls the occurrence probability of the over damping to be about 10 percent, and further improves the setting of the particle swarm optimization method: taking c ₁ ＝1.5,c ₂ ＝0.5,η _1,2 ∈[0.5,1],η ₃ ∈[0,1]At this time c ₂ η ₃ ∈[0,0.5]The probability of the system becoming overdamped increases.

As shown in FIG. 3, the method has pertinence in improving the method by analyzing the relation between the parameter setting and the particle evolution of the particle swarm optimization method.

If the second-order control system meets the convergence state when the steady-state error reaches 2%, the convergence time t of the particles _system The method comprises the following steps:

wherein t is _system Is the particle convergence time. As can be seen from the formulae (3) and (8), if w is increased _p The convergence time and system frequency increase and the system damping decreases. Formula (8) does not consider ζ _p <0. Theoretically t when the system diverges _system Toward infinity, but considering that the particles eventually return to a converging state, it is impossible to permanently diverge, and the divergence is considered to extend the convergence time only, set to be _p <ε _p The system diverges, so that the convergence time when the system diverges can be obtained by rewriting (8):

wherein ln represents natural logarithm, ε _p Is a set threshold value;

when epsilon _p ＜ξ _p When less than 1, the second-order control system is in an underdamped state, and when xi _p When the damping is not less than 1, the second-order control system is in an over-damping state, and when the damping is not less than ζ _p ≤ε _p The time second order control system is in a divergent stateA state;

define agent state s as:

wherein G is _best P derived for current evolutionary algebra _g Corresponding performance index, G _b ′ _est P evolved for the previous generation _g Corresponding performance index, G _i P, the current algebra of evolution _i The corresponding performance metrics, avg () represents the calculated average, δ _swarm Is population diversity.

The performance index can be set according to specific problems and requirements, and the performance index is set only by meeting the conditions of minimization and dimensionless, for example, the performance index can be set to have minimum heat absorption, maximum range, minimum load and the like according to specific problems and requirements.

Other steps and parameters are the same as in the first or second embodiment.

The specific embodiment IV is as follows: this embodiment will be described with reference to fig. 2. This embodiment differs from one to three embodiments in that the inertial weight l _p The value range of (1) is (0, 1)]。

Other steps and parameters are the same as in one to three embodiments.

Fifth embodiment: this embodiment differs from the embodiments by one to four in that the threshold ε _p The value of (2) is 0.02.

Other steps and parameters are the same as in one to four embodiments.

Specific embodiment six: the difference between this embodiment and one to fifth embodiments is that the method for calculating the population diversity is as follows:

Other steps and parameters are the same as in one of the first to fifth embodiments.

Seventh embodiment: this embodiment differs from one of the first to sixth embodiments in that the convergence state of the particles is:

when the particle swarm optimization method converges, there are:

wherein E represents a mathematical expectation, p _i,j ＝p，p _g,j ＝g。

Eighth embodiment: this embodiment differs from one of the first to seventh embodiments in that the step 25 specifically includes:

defining an action a of the agent:

a＝[l _p ,D _w ] ^T (13)

wherein, the upper corner mark T represents transposition, D _w ＝±1；

other steps and parameters are the same as those of one of the first to seventh embodiments.

Detailed description nine: this embodiment differs from one to eight embodiments in that the specific process of step 27 is:

Other steps and parameters are the same as in one to eight of the embodiments.

Equation (14) shows that when the performance index is improved, i.e., the particle swarm finds a better position, the agent obtains a prize value of 10 in this step, otherwise, no prize is obtained. The goal of designing reinforcement learning-particle swarm hybrid optimization methods is: achieving D-pair with least particle count and evolutionary number in particle swarm optimization _pso Optimizing parameters, thus taking a fixed number of steps per round of travel of the agent, i.e. when t _p ＝t _p,max The agent ends the round.

Detailed description ten: this embodiment is different from one of the first to ninth embodiments in that the specific process of the step 3 is:

step 31, initializing the position and the speed of the particles, and updating the position of the particles according to the initialized position and speed of the particles (i.e. according to formula (1));

According to the method, the control variable of the sub-orbit hypersonic carrier is updated according to the updated particle position, so that the sub-orbit hypersonic carrier is controlled.

Other steps and parameters are the same as in one of the first to ninth embodiments.

The above examples of the present invention are only for describing the calculation model and calculation flow of the present invention in detail, and are not limiting of the embodiments of the present invention. Other variations and modifications of the above description will be apparent to those of ordinary skill in the art, and it is not intended to be exhaustive of all embodiments, all of which are within the scope of the invention.

Claims

1. The method for optimizing the online track of the sub-track hypersonic carrier based on reinforcement learning-particle swarm hybrid optimization is characterized by comprising the following steps of:

step 2, training the reinforcement learning agent;

step 22, setting the round number to be 1;

step 23, let evolution algebra p=1;

step 25, randomly exploring actions of the intelligent agent;

if yes, return to execute step 24;

otherwise, executing step 28;

if yes, returning to the execution step 23;

if not, ending training;

step 3, online optimizing the track of the sub-track hypersonic carrier by using the trained reinforcement learning agent;

the specific process of the step 3 is as follows:

2. The method for optimizing the online trajectory of a sub-orbital hypersonic vehicle based on reinforcement learning-particle swarm hybrid optimization according to claim 1, wherein the learning factor c ₁ =1.5, learning factor c ₂ =0.5, random number η ₁ ∈[0.5,1]Random number eta ₂ ∈[0.5,1]Random number eta ₃ ∈[0,1]，M＝2000，m＝30。

3. The method for optimizing the online trajectory of the sub-orbital hypersonic vehicle based on reinforcement learning-particle swarm hybrid optimization according to claim 2, wherein the specific process of step 24 is as follows:

by M _pso Particle pair D _pso Searching by optimizing parameters, wherein the position vector of the ith particle in the search space isx _i,1 ,x _i,2 ,...,/>Representing the ith particle pair 1 st, 2 nd, …, D in the search space _pso Search results of the optimization parameters, the velocity vector of the ith particle in the search space isThe upper corner mark T represents the transposition, v _i,1 ,v _i,2 ,...,/>Representing the ith particle pair 1 st, 2 nd, …, D in the search space _pso The speed of forward movement when searching for the optimum parameters i=1, 2, …, M _pso ；

all particles evolved in the manner of formula (1):

wherein t is _p Is the current evolution algebra;represents the t _p At evolution, the speed of forward movement of the ith particle in search space searching for the jth optimization parameter,/->Represents the t _p At generation-1 evolution, the speed of forward movement of the ith particle in the search space when searching for the jth optimization parameter, l _p Is inertial weight, p _i,j Is the historical best position of the jth optimization parameter found by the ith particle, p _g,j Is the historical best position of the j-th optimization parameter found by all particles, < >>Represents the t _p When the generation evolves, the ith particle is searching for the empty spaceSearch results in the middle for the j-th optimized parameter position,/->Represents the t _p -upon evolution of generation 1, search results of the ith particle for the jth optimized parameter position in the search space;

introducing a random number eta ₃ Formula (1) is rewritten as formula (2):

obtaining a second order control system according to equation (2):

wherein ln represents natural logarithm, ε _p Is a set threshold value;

define agent state s as:

4. The method for optimizing online trajectories of a sub-orbital hypersonic vehicle based on reinforcement learning-particle swarm hybrid optimization according to claim 3, wherein the inertial weight l _p The value range of (1) is (0, 1)]。

5. The method for optimizing online trajectories of a sub-orbital hypersonic vehicle based on reinforcement learning-particle swarm hybrid optimization of claim 4, wherein the threshold ε _p The value of (2) is 0.02.

6. The method for optimizing the online trajectory of the sub-orbital hypersonic vehicle based on reinforcement learning-particle swarm hybrid optimization according to claim 5, wherein the method for calculating the swarm diversity is as follows:

7. The method for optimizing the online trajectory of a sub-orbital hypersonic vehicle based on reinforcement learning-particle swarm hybrid optimization according to claim 6, wherein the convergence state of the particles is:

wherein E represents a mathematical expectation, p _i,j ＝p，p _g,j ＝g。

8. The method for optimizing the online trajectory of a sub-orbital hypersonic vehicle based on reinforcement learning-particle swarm hybrid optimization according to claim 7, wherein the step 25 is specifically:

defining an action a of the agent:

a＝[l _p ,D _w ] ^T (13)

wherein, the upper corner mark T represents transposition, D _w ＝±1；

9. the method for optimizing the online trajectory of the sub-orbital hypersonic vehicle based on reinforcement learning-particle swarm hybrid optimization according to claim 8, wherein the specific process of step 27 is as follows: