CN112818599A

CN112818599A - Air control method based on reinforcement learning and four-dimensional track

Info

Publication number: CN112818599A
Application number: CN202110134760.1A
Authority: CN
Inventors: 俎文强; 季玉龙; 何扬; 黄操
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2021-01-29
Filing date: 2021-01-29
Publication date: 2021-05-18
Anticipated expiration: 2041-01-29
Also published as: CN112818599B

Abstract

The invention discloses an air control method based on reinforcement learning and four-dimensional tracks, which comprises the steps of firstly establishing airplane aerodynamic performance models of different types; acquiring four-dimensional track data of different airplane types on different air routes according to the airplane pneumatic performance model; generating a four-dimensional track model of the airline-type through data playback; and finally, building a neural network based on a reinforcement learning algorithm, training the four-dimensional track pressed on the movement of the airplane, constructing a nested reinforcement learning model of a nested speed intelligent body in the course intelligent body, realizing the selection of the route of the airplane by selecting the target course direction of the airplane, and realizing the control of the arrival time of the airplane by selecting the target speed of the airplane, thereby realizing the function that the airplane presses the four-dimensional track model according to the specified time, speed, course and altitude. The invention can provide a feasible solution for the problems of large flow, complex airplane scheduling method, difficult air control and the like of the current airport.

Description

Air control method based on reinforcement learning and four-dimensional track

Technical Field

The invention relates to the technical field of intelligent air traffic control, in particular to an air traffic control method based on reinforcement learning and four-dimensional tracks.

Background

A new generation of air traffic control should be intelligent. This is because high density traffic conditions and large numbers of aircraft present significant challenges to air traffic controllers (ATCos), and they therefore require automatic approach to reduce complexity, particularly at landing (arrival) and takeoff. One simple way to automatically implement the air traffic control problem is to control the aircraft to fly along the calculated 4D trajectory by artificial intelligence ATCos.

The european air traffic authority has determined that data-driven trajectory prediction, in particular 4D trajectories that are typically predicted using aircraft aerodynamic performance models, is one of the key pillars for future air traffic management. It emphasizes the importance of the air traffic control method based on the track and airplane performance model

Methods based on trajectory or aircraft performance models have been extensively studied in the field of air traffic control. Klomp proposed in 2019 a conceptual decision support tool for 4D trajectory management, aiming to overcome these problems by directly visualizing the solution space related to actions. The feasibility of the concept is verified by performing a preliminary verification on the partial implementation of the solution space representation. Jacco et al in 2016 proposed a project Bluesky that investigated the feasibility of air traffic simulation of fully open source and open data methods. One of the main contributions is to achieve high fidelity, e.g. the aircraft performance is truly modeled on the aerodynamic performance of the aircraft.

The research on the automatic air traffic control method by Marc Britain in 2018 provides a deep reinforcement learning method, uses an air traffic control simulator created by NASA as an environment to test the reinforcement learning technology of Marc Britain, provides tactical decision support for air traffic controllers, selects a route and changes speed for each airplane, and solves the sequencing and separation problems of autonomous air traffic control. They have designed a nested agent structure where the master agent takes an action (changes the route) and the nested agent is responsible for speed control, solving the problem of not being able to plan an environment as a typical single agent environment due to the non-markov nature involved in this problem. Nested agents can decouple the set of actions that change routes and change speeds. The results show that the reward number tends to oscillate frequently, but increases, throughout the training. However, their approach is not applicable in all cases. In addition, in their research, NASA33 was used as a simulator, and only the case where the aircraft was born at a fixed location and moved on a limited path was considered, and the influence of the aircraft aerodynamic performance package on the flight path of the aircraft was not considered. They employ a DQN-based deep nested agent approach, which is a value-based reinforcement learning approach that is applicable to discrete environments, but not to continuous environments.

Vonk explored the possibility of applying reinforcement learning techniques to air traffic control in the sequencing and spacing of airplanes in 2019. The experiment was aimed at learning to navigate to the FAF point at the same time, while arriving at the correct time, to simulate interaction with arriving agents. However, the results are not stable. The limitation of this approach is that they train the aircraft only by heading instructions, regardless of speed factors, and at constant speed they do not know the trajectory that the AI ultimately chooses and cannot control the direction of arrival.

As for recent research advances, several researchers have proposed nested approaches to enhance learning. Surioyo Ghosh proposes an intelligent air traffic control method based on a multi-agent reinforcement learning algorithm in 2020, and the main method is to train a single main neural network to solve the interaction influence among the multi-agents. They discovered a multi-agent reinforcement learning optimal learning paradigm, however, their main research direction was air traffic collision detection and avoidance. The methods proposed by them are not applicable to the field of four-dimensional trajectory-based air traffic control because they do not take into account the time constraint to reach the target, which is a condition that four-dimensional trajectory-based air traffic control must take into account and rely on.

In summary, the problems of the prior art are as follows:

(1) in the prior art, in the process of solving the air control problem based on the four-dimensional track by using the traditional reinforcement learning method, the problem of sparse reward occurs, and how to solve the sparse reward is one of the difficulties; in addition, the design of the reward function is a difficult point for the training of multi-objective agents.

(2) In the prior art, most of the algorithms are similar to the aircraft collision avoidance algorithm for researching the air traffic control field based on reinforcement learning, and the algorithms are helpful for the research of a specific field, but are not a widely applicable air traffic control method. And the intelligent air control method based on the four-dimensional track is one of the basic widely-applicable methods.

(3) In the prior art, an air control method based on reinforcement learning and four-dimensional trajectory has great limitations, such as: poor stability, low accuracy, more restriction conditions, etc. In addition, due to the limitation of accuracy and complexity of the algorithm, the influence of multiple factors cannot be considered at the same time, most of the factors only consider one influence factor, such as the heading angle of the airplane or the speed of the airplane, and the influence of multiple factors cannot be considered at the same time, so that the condition for practical use is temporarily not met.

Moreover, aiming at the design problem of the reward function of the multi-target intelligent agent, the following problems can occur when the reward function is designed by self: 1. the reward is abstract, the formula expression 2 is difficult, the parameters are more, the difficulty is higher than 3, and the reward function effect is poor.

Disclosure of Invention

In view of the above problems, an object of the present invention is to provide an air control method based on reinforcement learning and four-dimensional trajectory, which can provide a feasible solution to the problems of large traffic, complex airplane scheduling method, difficult air control, etc. faced by the current airport. The technical scheme is as follows:

an air control method based on reinforcement learning and four-dimensional track comprises the following steps:

s1: establishing airplane pneumatic performance models of different types by modeling the engine performance of different types;

s2: acquiring four-dimensional track data of different types of airplanes on different routes according to the airplane pneumatic performance model; generating a four-dimensional track model of the airline-type through data playback;

s3: based on a reinforcement learning algorithm, a neural network is built, four-dimensional tracks are pressed on the movement of the airplane for training, a nesting reinforcement learning model of a nesting speed intelligent body in a heading intelligent body is built, the selection of the route of the airplane is realized by selecting the target heading of the airplane, and the control of the arrival time of the airplane is realized by selecting the target speed of the airplane, so that the function that the airplane presses the four-dimensional track model according to the specified time, speed, heading and altitude is realized.

Further, the specific process of S1 is as follows: the method comprises the steps of defining a key position point with airplane motion state information, selecting an airplane of a specific airplane type in a flight simulation system with an airplane pneumatic performance model to simulate flight according to a route summarized by a specified position point, recording information including flight time, six degrees of freedom of the airplane and environmental factors at fixed time intervals, and storing the information in a recording file.

Further, the specific process of S2 is as follows:

s21: acquiring the track points meeting the conditions to form a track point set G, and mapping each track point to a flight line to obtain a discrete track point mapping point set G' on the flight line;

G＝{g_i,i＝1,2,3...,n} (1)

G'＝{g'_i,i＝1,2,3...,n} (2)

wherein, g_iIs a navigation point, g 'meeting the conditions'_iIs a track point g_iA mapped point on the flight path; n is the number of samples;

s22: calculating mapping points g 'of each navigation point'_iDistance s to the start of each leg_iObtaining a sample set W' of discrete track point mapping points on the route with respect to distance and speed;

W'＝{(s_i,v_i),i＝1,2,…,n} (3)

wherein s is_iIs the distance from the sampling point to the beginning of the flight, v_iIs a one-dimensional output vector and is expressed at the starting point s of the range flight_iSpeed of the aircraft at location;

s23: for the collected sample set W', LSSVM in machine learning is selected, and each sample set is usedDistance xi from sample point to respective hyperplane_iRepresents the empirical risk of LSSVM, and the least empirical risk of training is

Minimum, its mathematical model is:

wherein w is v_iAbout s_iA linear parameter of (d); b is a linear offset;

according to the principle of minimizing the structural risk, the LSSVM needs to ensure the distance maximization of two classification hyperplanes, and the solved mathematical model is a compromise between empirical risk and structural risk, namely

Where C is a penalty factor and the distance ξ from a sample point to its hyperplane_iIs a training error;

s33: to solve this optimization problem, Lagrange's function is introduced:

wherein alpha is_iN is Lagrange multiplier, e is unit vector;

representation ws_iw/|w|；

The following relationship is obtained from the KKT condition:

kernel function

s_jIs a navigation point mapping point g'_jDistance to the starting point of each leg; then the solution form of equation (7) is converted into:

wherein Q is an element K_ijK × k order kernel matrix of (1), I is the identity matrix, and vector e ═ 1, …,1]^TThe vector α ═ α₁,…,α_n]^TVector v ═ v₁,…,v_n]^T；

Solving formula (8) to obtain alpha_iAnd substituting the value of b into the formula (6) to obtain the chaotic time series regression model of the LSSVM, wherein the chaotic time series regression model of the LSSVM is as follows:

the speed value of each position point s on the corresponding route is as follows:

and after mapping of the flight path s-v is obtained, a four-dimensional trajectory model of the flight path-model is induced.

Further, the mapping each course point to the course in S21 includes:

mapping the data of the straight line route: drawing a perpendicular line to the straight course l through each course point, and generating an intersection point with the course, namely a mapping point corresponding to the course point;

and (3) arc course data mapping: and connecting each track point with the center of a circle of an arc course, wherein the intersection point of the formed straight line and the arc line is the mapping point corresponding to the track point.

Further, the S3 specifically includes:

s31: setting up an experimental environment in a simulation system, determining the type of a training aircraft, the birth position of the aircraft and the simulation speed, and initializing the environment;

s32: according to the PPO algorithm, a reinforcement learning algorithm is built:

(1) setting a state space:

there are two agents in the reinforcement learning experiment: selecting a speed intelligent agent of the speed and an intelligent agent for changing course;

the state space for setting the intelligent body with the body course is as follows: [ Δ lat, Δ lon, tarhdg, hdg ];

wherein, Δ lat is the difference between the target latitude and the aircraft latitude, and Δ lon is the difference between the target longitude and the actuator longitude; tarhdg represents the target heading, hdg represents the actuator heading;

the state space for setting the speed agent is: [ Δ lat, Δ lon, tarhdg, hdg, cas, time ];

wherein cas represents the calibrated speed of the aircraft and time represents the remaining time from the target;

(2) setting an action space:

defining an action space of a heading agent: a. the_t＝[0,hdg,360]

Wherein the minimum course is 0 degree, the maximum course is 360 degrees, and the motion space is distributed from 0 to 360 degrees;

defining the motion space of the velocity agent: a. the_t＝[v_min,v_t-1,v_max]

Wherein v is_minIs the minimum allowable calibration speed, v_maxIs the maximum allowable calibration speed, the motion space is a distribution from 0 to 1000;

(3) setting a reward function

The reward function of the heading agent is used for guiding the agent to select the heading, and is expressed as:

R＝α_d*d+α_h*Δhdg (11)

wherein d is the distance from the current position of the airplane to the target position, delta hdg is the current heading minus the target heading, alpha_dAnd alpha_hThe coefficients of the distance and the course are respectively;

the velocity intelligent reward function is used to change the speed of the aircraft to arrive at the target location at the correct time, expressed as:

Δd＝α_d*(d-d′) (12)

wherein d' is the current speed of the aircraft multiplied by the delay time;

defining the merit function as: ad-Q, real-Q estimation; defining an analysis function ratio as the distribution difference of two probability distributions after importance sampling processing;

the loss function is defined as: a policy function network updates the policy by maximizing the dominance function;

s33: training a nested reinforcement learning model with a course and a speed selected respectively:

(1) taking a course intelligent agent as a main intelligent agent, nesting a speed intelligent agent in the main intelligent agent, adopting an action of selecting the course through the main intelligent agent, and realizing an action of controlling and changing the speed through the nested speed intelligent agent; the state space of the main agent is [ delta lat, delta lon, tarhdg, hdg ], and the state of the nested agent is [ delta d ];

(2) the main agent and the nested speed agent of the nested reinforcement learning model have the same neural network structure, namely an AC structure; for the evaluation network, the merit function is defined as:

wherein, theta_cIs a parameter for evaluating a neural Network (Critic Network) matrix, R_tIs an instant reward for the user,

is a function of the state value of the next state,

is a function of the state value of the current state; γ is a value between 0 and 1, representing a future attenuation factor, which is considered less the further away from the present;

using least squares, parameter θ_cThe update formula of (2) is:

where α is the learning rate of the evaluation network definition, θ_cAre parameters of a neural network matrix of the evaluation network,

representing the parameter theta_cThe step size of the update;

for Policy Network (Policy Network), a Policy gradient method is adopted, and pi (a | S)_t,θ_p) Is shown in state S_tProbability of lower selection a action; parameter theta of action neural network_pThe update formula of (2) is:

where α is a learning rate defined by the policy network, and is the same as the learning rate in equation (14), θ is expressed by the same learning rate α_pAre parameters of the neural network matrix of the policy network,

representing the parameter theta_pThe step size of the update;

and finally, optimizing the random sampling data by adopting a failure experience playback method, thereby optimizing the convergence direction of the neural network and solving the problem of sparse reward.

Further, in S31, initializing the environment includes: randomly generating a plurality of navigation points in the landing direction of the airport, and randomly generating a delay time sequence to enable the training airplane to land through the navigation points in a correct time sequence; the plane is randomly born in a designated area, the course is random, and the speed and the altitude are set.

Further, the step S3 is followed by:

s4: updating the four-dimensional track model according to the simulation time of the simulation system, so that each four-dimensional track point has a time tag as a time identifier; and judging whether the time mark of the current four-dimensional track point is occupied or not when the new aircraft selects to occupy the four-dimensional track point according to the time mark of the four-dimensional track point, and if so, giving up the current four-dimensional track point and reselecting.

Furthermore, in the route-model four-dimensional trajectory model established in S2, each four-dimensional trajectory point is generated in a route independently, or is distributed in different routes having intersecting route sections, and the same four-dimensional trajectory point distributed on different routes becomes the same four-dimensional trajectory point at the intersecting route section route.

The invention has the beneficial effects that:

(1) the present invention has different advantages from several algorithms for air traffic control that were produced in the last two decades. According to the invention, the advantages and the disadvantages under the current air flow environment are analyzed by researching the mainstream air control algorithm, and finally, an air traffic control method based on reinforcement learning and four-dimensional track is selected; the advantages of non-reinforcement learning air traffic control and reinforcement learning-based air traffic control are combined, innovation is provided on the basis, and a four-dimensional track model is adopted to realize and simplify the air control method under large-scale flow; experimental results under a flight simulation verification system show that the air traffic control algorithm is suitable for air control under large flow, and can be integrated in a related project engine or frame.

(2) In the stage of constructing the database, the invention generates the aerodynamic performance model of the airplane by modeling the engines of different types. The simulation system is used for controlling the movement of the airplane and is established on the basis of the aerodynamic performance model of the airplane, so that the airplane can move and simulate according to the real movement process of the airplane, and the reliability of the experimental result of the simulation system is effectively enhanced.

(3) In the data processing stage, the flight motion data of different airplane types under different airlines are collected to obtain an airline-airplane type four-dimensional track model. The acquired data comprises information of longitude and latitude, speed, height, course angle, roll angle, pitch angle and the like of the airplane. The method for generating the four-dimensional track model by data acquisition and data playback can solve the problem that the aircraft state cannot be calculated due to the fact that the speed is difficult to predict.

(4) In the core algorithm design stage, a reinforcement learning algorithm is introduced, the four-dimensional track is taken as a target, and the purpose of fixing the course and pressing the four-dimensional track in time is realized by adjusting the course, the speed and the height of the airplane. And in the algorithm training stage, a four-dimensional track and the appearance posture of the airplane are set, the airplane is ensured to appear in a proper range, such as a distance of 20-50km from a four-dimensional track point, and the convergence of the acceleration algorithm is guided by adopting the thought of simulated learning and expert data.

Drawings

Fig. 1 is a flowchart of an air control method based on reinforcement learning and four-dimensional trajectory according to an embodiment of the present invention.

FIG. 2 is a diagram for selecting the effective data range of the straight flight of the aircraft based on the 4D track aviation control method.

FIG. 3 is a diagram for selecting the effective data range of the arc flight of the aircraft based on the 4D track aviation control method.

FIG. 4 is a geometric schematic diagram of an LSSVM algorithm of the 4D track-based air traffic control method of the invention.

Fig. 5 is a schematic diagram of the distance between the end point and the target point in 200 experimental results provided by the example of the present invention.

Fig. 6 is a schematic diagram of the track angle to the end point in 200 experimental results provided by the example of the present invention.

FIG. 7 is a schematic diagram of the delay time of the end point in 200 experiments according to the embodiment of the present invention.

Detailed Description

The invention is described in further detail below with reference to the figures and specific embodiments. The invention refers to the idea of reverse reinforcement learning, pre-trains a proper evaluation network and a strategy network by adopting expert data in advance, and fits an incentive function according to a formula of accumulated return. The method for fitting the reward function by the reverse reinforcement learning algorithm can avoid the problems of incomplete design consideration and poor convergence effect of subjective design of the reward function.

Aiming at the problem of sparse reward of intelligent agent training, the sparse reward can cause slow convergence speed of the algorithm and even the algorithm is easy to fall into local optimal dilemma. The present invention employs failed experience playback (HER) to avoid situations where rewards are sparse. The reward sparseness means that the number of times that an agent can reach a target is small when the agent trains in a wide space, the learning efficiency of the agent is low, the learning frequency is easily increased, and the training effect is worse and worse. The method for replaying the failed experience can effectively solve the problem of sparse reward, and the method has the following idea: and modifying the target value of each piece of data, so that the modified data is effective data reaching the target.

Aiming at the problem of multi-target intelligent agent multi-strategy network output, the invention adopts a nested reinforcement learning method, which comprises two intelligent agents for respectively controlling the course and the speed of an aircraft. The main intelligent strategy network is a course control network, the output of the strategy network is the probability distribution of the target course of the aircraft, and the selection of the aircraft route is realized by selecting the target course of the aircraft; the strategy network of the nested intelligent agent is a speed control network, the output of the strategy network is the probability distribution of the target speed of the aircraft, and the control of the arrival time of the aircraft is realized by selecting the target speed of the aircraft.

The invention realizes the intelligent control of air flow by using a four-dimensional track model and a reinforcement learning algorithm, and the flow of the method is shown in figure 1, and the method specifically comprises the following steps:

the method comprises the following steps: the method comprises the steps of establishing airplane pneumatic performance models of different models by modeling engine performance and other performances of different models.

The modeling method provided by the embodiment of the invention mainly aims at modeling the aerodynamic performance model of the fixed-wing airplane and the helicopter, and mainly models the performances of the airplane, such as the speed, the acceleration, the rise rate and the like.

Step two: according to the airplane aerodynamic performance model, four-dimensional track data of different airplane types to different air routes are collected, and then the air routes and the four-dimensional track model are generated through data playback.

The speed models of different models provided by the embodiment of the invention are different, and different route-four-dimensional track models can be generated aiming at different models of the same route. Six-degree-of-freedom information and time information of the airplane are collected every 1s, and a course-four-dimensional track model is generated through data playback.

Step three: based on a reinforcement learning algorithm, a neural network is built, the four-dimensional track is pressed on the airplane motion for training, and the four-dimensional track is pressed on the airplane according to the specified time, speed, course and altitude by using the trained neural network.

The reinforcement learning algorithm provided by the embodiment of the invention adopts a random strategy and a gradient descent method to build a multilayer neural network and construct an intelligent agent by taking the four-dimensional track to be pressed as a target. The stochastic strategy is then updated by sampling of the agent and gradient descent of the neural network.

Step four: aiming at the problem that four-dimensional track points of a plurality of routes are intersected and overlapped, time attributes are set for each four-dimensional track point, four-dimensional tracks with the same time labels are converged into the same four-dimensional track during intersection, and updating of a four-dimensional track model and a conflict avoidance algorithm of the four-dimensional track model are achieved according to the running time of a four-dimensional track system and the time labels of the four-dimensional tracks.

The application principle of the present invention is further explained with reference to the following specific embodiments;

example 1: four-dimensional track air control implementation process and analysis

(one) four-dimensional trajectory data acquisition

(1) Flight simulation system

The invention selects a flight simulation system with an airplane pneumatic performance model for experiment. The flight of an aircraft for simulation training may be subject to aerodynamic performance model constraints, i.e., aircraft performance constraints such as engine performance, aircraft weight, and the like. The airplane trained by the flight simulation system with the airplane pneumatic performance model is more suitable for the flight condition of a real airplane to a certain extent, and the training result is more suitable for being applied to a real flight environment.

(2) Course four-dimensional trajectory data acquisition

A flight path typically has several key location points that should have heading, altitude, speed, etc. of the aircraft's moving state information. From these key location points, a route can be generalized. The specific routes are different due to the difference of airplane models, so each model corresponds to one route.

After the key position points are defined, the flight simulation system with the airplane pneumatic performance model selects an airplane of one airplane type to simulate flight according to the air route summarized by the specified position points, records information such as flight time, six degrees of freedom of the airplane, environmental factors and the like at fixed time intervals, and stores the information in a recording file.

(II) establishing a route

And mapping the qualified discrete track points to the route to form the discrete track points on the route.

The collected track point set meeting the conditions is as follows:

G＝{g_i,i＝1,2,3...,n} (1)

the straight-line course data map is shown in FIG. 2.

The method comprises the following steps: each track point is perpendicular to the route l and intersects with the route, and the attribute of the track point is the attribute of the intersection on the route, so that all points in the rectangle are mapped onto the route to form a discrete track point set on the route.

The arc course data map is shown in FIG. 3.

The method comprises the following steps: each track point is connected with the center of a circle to form an intersection point of a straight line and an arc line, and the attribute of the track point is the attribute of the intersection point on the arc line, so that all points in the sector are mapped onto the arc line to form a discrete track point set on the arc line.

Obtaining a discrete track point set on the route after data mapping:

G'＝{g'_i,i＝1,2,3...,n} (2)

as shown in FIG. 2, each discrete point g can be obtained according to the distance formula_iTo the voyageDistance s of origin E_i. Similarly, as shown in FIG. 3, each discrete point g 'may be calculated'_iDistance s 'to arc origin B'_i. The set of discrete track points on the course with respect to distance and speed is:

W'＝{(s_i,v_i),i＝1,2,…,n} (3)

s_iis the distance, v, from the sample point to the next fixed point_iIs a one-dimensional output vector, represented at s_iThe speed of the aircraft at location, n being the number of samples.

For the acquired sample set W', due to s_iAnd v_iThe linear fitting method is a nonlinear relation, the speed situation of the aircraft in the air cannot be well described by a simple linear fitting method, and in order to solve the problem, an LSSVM (Least square support vector machine) method in machine learning is selected.

The LSSVM is an improvement of an SVM (support vector machine), and has the advantages of less resources for introducing least square loss functions and equality constraints and higher solving speed. Compared with SVM, the empirical risk of LSSVM is based on the distance xi from each sample point to the respective hyperplane_iIs expressed by the sum of squares, where ξ_iRepresenting the point-to-face distance.

The least empirical risk for training is

Minimum (here, it is expressed that the square sum of the distances from each sampling point to the respective hyperplane is minimum) and its mathematical model is:

according to the principle of minimizing the structural risk, the LSSVM also needs to ensure the distance maximization of two classification hyperplanes, and the solved mathematical model is a compromise between the empirical risk and the structural risk:

c is a penalty factor, ξ_iTo train the error. To solve this optimization problem, a Lagrange function may be introduced:

wherein alpha is_i1, n is Lagrange multiplier, and the following relation is obtained under the KKT condition:

kernel function

The solution of equation (7) can be formalized as:

wherein Q is an element K_ijK × k order kernel matrix of (1), I is the identity matrix, and vector e ═ 1, …,1]^TThe vector α ═ α₁,…,α_n]^TVector v ═ v₁,…,v_n]^T. Solving formula (8) to obtain alpha_iAnd substituting the value of b into the formula (6) to obtain the chaotic time series regression model of the LSSVM, wherein the chaotic time series regression model comprises the following steps:

and after mapping of the flight line s-v is obtained, a flight line model is concluded.

(III) reinforcement learning algorithm

After the four-dimensional track of the air route is calculated, how to allocate and schedule the aircraft to join the air route becomes a more critical problem. The problem of difficult direct decision-making can be solved by applying a reinforcement learning algorithm. The reinforced learning algorithm is used for airplane scheduling and can be divided into two stages:

(1) experimental training phase

The core of the reinforcement learning algorithm is a neural network, and the finally trained neural network can be directly used in a flight simulation environment for airplane deployment.

The reinforcement learning experiment can be mainly divided into two parts: environment and algorithm.

The state, the motion mode, the pneumatic constraint, the reward function, the training target and the like of the aircraft are defined in the environment: the state of the airplane is a list comprising longitude and latitude, heading, altitude, speed, state of a target point and the like of the airplane, the airplane can output action values according to the currently trained neural network and the input state list of the intelligent agent, the action values are processed into actions which can be executed by the airplane after constraint of the pneumatic performance model, such as heading speed and the like, the next state can be calculated through calculation, and one step in one cycle (one screen) is completed.

The most important thing of the reinforcement learning algorithm is the process of intelligent agent learning, namely the process of neural network updating. In the main cycle of the experiment, the current reward can be calculated by the current aircraft state value in each step, and the updating of the neural network is consistent with the increasing direction of the reward value, so that the neural network is updated, namely the learning process of the aircraft.

The method comprises the following specific steps:

1) setting up experimental environment in simulation system

The experimental environment is mainly based on the Bluesky simulation environment. Bluesky is an open air traffic control simulator using the Openap aircraft performance model.

The Bluesky simulator provides a plug-in function module, is a simple and extensible tool for communicating and interacting with the server, and can help to call the function of the flight control module. Therefore, a reinforcement learning experiment environment is established in the plug-in module.

The specific experimental environment settings were as follows:

we chose F16 as the training aircraft type; we select the aircraft birth location at latitude 30 to 38, longitude 103 to 106; to process a large amount of sampled data, we choose to use the maximum simulation speed in the training.

The experiment simulates the landing process of an airplane at the ZUUU airport. At the start of training, the environment has been initialized as follows:

the system randomly generates 3-5 navigation points at reasonable positions of the ZUUU airport landing direction, and simultaneously randomly generates a reasonable time delay sequence, so that the training airplane can land through the navigation points in a correct time sequence. The delay time sequence should be such that 65,180, 252 … … is a random, reasonable time sequence in seconds. The delay time series cannot be too high or too low, which can cause the speed of the aircraft to become unreasonably high or too low. The aircraft randomly emerges in the south airspace of ZUUU, and has random heading, the speed of 500 km/h and the height of 500 m. The AI controller is not involved in the adjustment of the aircraft altitude, and we use the basic "ALT" command of BlueSky to change the altitude, enabling landing simulation.

2) Building reinforced learning algorithm

The motion of the airplane is a continuous motion in a sparse space, and a random strategy is generally selected to update a neural network, so that the defect that parameter adjustment and local optimization are easy to fall into caused by a deterministic strategy can be effectively avoided. The algorithm adopts a PPO (Proximal Policy optimization) algorithm as a prototype, and comprises a Policy network and an evaluation network (a state value function network).

State space:

in the context of reinforcement learning experiments, all possible states can have an impact on the results of the experiment. Therefore, when setting the state space, all parameters that may have an influence on the experimental results need to be considered. Our experimental goal is to reach the target location (latitude, longitude, heading) within a certain time. And in the experiment, two intelligent agents exist, wherein the speed intelligent agent selects the speed, and the heading intelligent agent changes the heading, so that a state space needs to be designed for the two models respectively.

For a heading agent, the goal of the model is to reach the target location (latitude, longitude, altitude) of the target heading, so we design a state space that includes [ Δ lat, Δ lon, tarhdg, hdg ]. Δ lat is the difference between the target latitude and the aircraft latitude, and Δ lon is the difference between the target longitude and the actuator longitude. tarhdg represents the target heading and hdg represents the actuator heading. cas represents the calibrated speed of the aircraft.

For a speed agent, the goal of the model is to reach the target location (latitude, longitude, altitude, heading) at a certain time. Therefore, the first state space we consider is [ Δ lat, Δ lon, tarhdg, hdg, cas, time ]. However, we have finally found that the state space of a velocity agent may only be related to increments of distance. One distance is the distance from the aircraft to the target location and the other distance is the delay time cas. The state space of the model has an additional parameter time, which represents the time remaining from the target.

An action space:

in the nested reinforcement learning algorithm, there are two models of output motion, which are heading and speed, respectively. The motion space for each model is defined as follows:

the motion space of the course intelligent agent is as follows: a. the_t＝[0,hdg,360]

Wherein the minimum course is 0 degree, the maximum course is 360 degrees, and the motion space is distributed from 0 to 360 degrees.

Motion space of speed agent: a. the_t＝[v_min,v_t-1,v_max]

Wherein v is_minIs the minimum allowable calibration speed, v_maxIs the maximum allowable calibration speed and the motion space is a distribution from 0 to 1000.

The reward function:

the aim of the experiment is to reach a defined state (latitude and longitude, heading, speed) within a defined time, and within a large time frame. There are two agents that calculate rewards separately. The input state of one agent is [ delta lat, delta lon, tarhdg, hdg ], and the output is header, which is used to select the path. The input states of the nested agent are [ Δ lat, Δ lon, tarhdg, hdg, cas, time ], output cas, which is used to control arrival time and velocity. Thus, there are two different reward functions in the experiment.

Theoretically, reinforcement learning rewards can be summarized as | | | current state-target state | |, i.e., the canonical or abstract distance between the current state and the target. This is an abstract concept and we should design the reward function according to the specific circumstances.

The reward function of the first agent directs the agent to select a heading, in other words, it selects a route. Based on the input state, we propose the following reward function:

R＝α_d*d+α_h*Δhdg (11)

where d is the distance (m) from the current position of the aircraft to the target position, Δ hdg is equal to the current heading of the aircraft minus the target heading, α_dAnd alpha_hAre coefficients of distance and heading.

The goal of the second agent is to change the speed of the aircraft to arrive at the target location at the correct time.

Δd＝α_d*(d-d′) (12)

Where d' is the current speed of the aircraft multiplied by the delay time.

3) Carry out training

Reaching a target location (latitude, longitude, heading) at the correct time is a difficult task. The traditional reinforcement learning method and algorithm are both single network structures, and the air traffic control task is difficult to completely solve. If we use a single model, perhaps we can adapt to a single task, for example: constant speed to a certain state requires no time or variable speed to a certain position requires no heading. Therefore, we propose nested reinforcement learning to solve this problem. A new nested reinforcement learning method is proposed to manage air traffic control tasks.

Based on a PPO algorithm method, a nested reinforcement learning model for respectively selecting course and speed is designed by taking advantage of a nested reinforcement learning method proposed by Marc in 2018. By using nested reinforcement learning models, we can train an agent and get more actions than just one. We use one master agent (select heading) with a second agent (select speed) nested in it. The master agent will take an action (select heading) and then the nested model will control the action of changing speed. One important difference between a master agent and a nested agent is the state space. The state space of the main agent is [ Δ lat, Δ lon, tarhdg, hdg ]. The state of the nested agent is [ Δ d ]. In the following we will describe the design details and training process of these two models separately.

The main agent and the nested speed agent of the nested reinforcement learning model have the same neural network structure, namely an AC structure; for the evaluation network, the merit function is defined as:

is a function of the state value of the next state,

using least squares, parameter θ_cThe update formula of (2) is:

wherein α is an evaluationLearning rate of network definition, theta_cAre parameters of a neural network matrix of the evaluation network,

representing the parameter theta_cThe step size of the update;

representing the parameter theta_pThe step size of the update.

How the agent explores the environment is crucial to the experiment. The sampling effect in the training process directly influences the convergence efficiency and the convergence result of the algorithm. The best result we can expect is that each sample is good data to reach the target. Therefore, random sampling data are optimized by using the HER method, the neural network has a better convergence direction, and the problem of sparse reward is solved. Nested agents have the same network structure as the master agent. The nested model is different from the master agent in the hyper-parameter.

The simulation system carries out simulation:

the PPO algorithm can be an off-policy algorithm, and the state of the neural network can be stored and loaded in real time.

In the flight simulation system, a trained neural network is called to dispatch the aircraft, target state information is transmitted in real time, and the aircraft can automatically add a four-dimensional track.

Updating of (four) four-dimensional trajectory model and collision avoidance algorithm for four-dimensional trajectory model

And updating the four-dimensional track model according to the simulation time of the simulation system, wherein each four-dimensional track point has a time tag as an identifier. In the simulation system, for each four-dimensional track, the earliest four-dimensional track point of the time label disappears at the tail end of the four-dimensional track firstly, and a new four-dimensional track point is generated after the starting end of the four-dimensional track reaches a certain simulation time.

The conflict of the four-dimensional trajectory model means: aiming at a certain four-dimensional track point, when an aircraft occupies or is about to occupy, a new aircraft selects to occupy the same four-dimensional track point. The conflict avoidance algorithm of the four-dimensional track model is that according to the time identification of the four-dimensional track point, when the new aircraft selects to occupy the four-dimensional track point, whether the time identification of the current four-dimensional track point is occupied or not is judged, if so, the current four-dimensional track point is abandoned, and the selection is carried out again.

(V) results

In order to verify the effectiveness and the applicability of the algorithm, a set of algorithm evaluation system is designed, the design experiment analyzes and compares the implementation effect of the algorithm from three angles of distance, angle and time, the obtained experiment result and data are shown, and the stability and the accuracy of the algorithm are further verified according to the established evaluation system and the corresponding index requirements.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. An air control method based on reinforcement learning and four-dimensional track is characterized by comprising the following steps:

2. The air control method based on reinforcement learning and four-dimensional trajectory according to claim 1, wherein the specific process of S1 is as follows: the method comprises the steps of defining a key position point with airplane motion state information, selecting an airplane of a specific airplane type in a flight simulation system with an airplane pneumatic performance model to simulate flight according to a route summarized by a specified position point, recording information including flight time, six degrees of freedom of the airplane and environmental factors at fixed time intervals, and storing the information in a recording file.

3. The air control method based on reinforcement learning and four-dimensional trajectory according to claim 1, wherein the specific process of S2 is as follows:

G＝{g_i,i＝1,2,3...,n} (1)

G'＝{g'_i,i＝1,2,3...,n} (2)

W'＝{(s_i,v_i),i＝1,2,…,n} (3)

s23: for the collected sample set W', an LSSVM in machine learning is selected, and the distance xi between each sample point and each hyperplane is used_iRepresents the empirical risk of LSSVM, and the least empirical risk of training is

Minimum, its mathematical model is:

wherein w is v_iAbout s_iA linear parameter of (d); b is a linear offset;

s33: to solve this optimization problem, Lagrange's function is introduced:

wherein alpha is_iN is Lagrange multiplier, e is unit vector;

representation ws_iw/|w|；

The following relationship is obtained from the KKT condition:

kernel function

4. The air control method based on reinforcement learning and four-dimensional trajectory according to claim 3, wherein the mapping of each trajectory point to a route in S21 comprises:

5. The reinforced learning and four-dimensional trajectory-based air control method according to claim 1, wherein the S3 specifically comprises:

(1) setting a state space:

there are two agents in the reinforcement learning experiment: selecting a speed intelligent agent of the speed and a course intelligent agent for changing the course;

setting the state space of a course intelligent agent as follows: [ Δ lat, Δ lon, tarhdg, hdg ];

wherein, Δ lat is the difference between the target latitude and the aircraft latitude, and Δ lon is the difference between the target longitude and the aircraft latitude longitude;

tarhdg represents the target heading, hdg represents the actuator heading;

(2) setting an action space:

defining an action space of a heading agent: a. the_t＝[0,hdg,360]

defining the motion space of the velocity agent: a. the_t＝[v_min,v_t-1,v_max]

(3) setting a reward function

R＝α_d*d+α_h*Δhdg (11)

Δd＝α_d*(d-d′) (12)

wherein d' is the current speed of the aircraft multiplied by the delay time;

wherein, theta_cIs to evaluate the parameters of the neural network matrix, R_tIs an instant reward for the user,

is a function of the state value of the next state,

using least squares, parameter θ_cThe update formula of (2) is:

representing the parameter theta_cThe step size of the update;

for the strategy network, a strategy gradient method is adopted, and pi (a | S)_t,θ_p) Is shown in state S_tProbability of lower selection a action; parameter theta of action neural network_pThe update formula of (2) is:

representing the parameter theta_pThe step size of the update;

6. The reinforced learning and four-dimensional trajectory-based air control method according to claim 5, wherein in the step S31, initializing the environment comprises: randomly generating a plurality of navigation points in the landing direction of the airport, and randomly generating a delay time sequence to enable the training airplane to land through the navigation points in a correct time sequence; the plane is randomly born in a designated area, the course is random, and the speed and the altitude are set.

7. The reinforced learning and four-dimensional trajectory-based air control method according to claim 1, wherein the S3 is followed by further comprising:

8. The air control method based on reinforcement learning and four-dimensional track according to claim 1, wherein in the route-model four-dimensional track model established in S2, each four-dimensional track point is generated in a single route or distributed in different routes having different routes with an intersection, and the same four-dimensional track point distributed on different routes becomes the same four-dimensional track point at the route with the intersection.