CN114141062B

CN114141062B - Aircraft interval management decision method based on deep reinforcement learning

Info

Publication number: CN114141062B
Application number: CN202111443511.7A
Authority: CN
Inventors: 刘泽原; 徐秋程; 丁辉; 史艳阳; 吴靓浩; 张婧婷; 陈飞飞; 徐珂; 谈青青
Original assignee: CETC 28 Research Institute
Current assignee: CETC 28 Research Institute
Priority date: 2021-11-30
Filing date: 2021-11-30
Publication date: 2022-11-01
Anticipated expiration: 2041-11-30
Also published as: CN114141062A

Abstract

The invention provides an aircraft interval management decision method based on deep reinforcement learning, which can realize end-to-end direct control from input to output. The method applies deep reinforcement learning to the aviation field, and designs an aircraft interval management decision method based on the deep reinforcement learning. And predicting and judging the situation in the terminal area by using a deep circulation Q network, and realizing the independent maintenance of the flight safety interval in the terminal area by designing and training a flight speed regulation strategy. The method is used for allocating the terminal area air aircrafts under the busy operation condition, so that conflict resolution and continuous conflict-free operation of busy sectors are realized, and the pressure of interval allocation decision of a controller on a complex operation scene is relieved, and the sector control operation efficiency and the safety guarantee capability are improved.

Description

Aircraft interval management decision method based on deep reinforcement learning

Technical Field

The invention relates to the field of civil aviation air traffic control, in particular to an aircraft interval management decision method based on deep reinforcement learning.

Background

With the rapid development of the air transportation industry, the demands of daily travel, cargo transportation and the like are rapidly increased, daily average flights of a busy airport are more than 1000 times, and certain efficiency needs to be sacrificed to ensure the safe flight of the aircraft at the safe interval of the aircraft under the busy condition, so that other influences are brought, such as the lengthening of the average flight time of the flights, the deviation of flight paths from standard on-off routes, the increase of the control pressure of controllers and the like. The existing control system can only detect short-term conflicts, can not predict medium-term and long-term conflicts, and can not provide interval maintenance decision suggestions for controllers.

Reinforcement learning is an important branch of machine learning, which is essential to describe and solve the problem of an agent learning a strategy to maximize reward or achieve a specific goal during interaction with the environment. With the development of deep learning, the reinforcement learning can directly extract and learn feature knowledge from original input data by means of a neural network in the deep learning, and then a control strategy is learned by utilizing a traditional reinforcement learning algorithm according to the extracted feature information without manually extracting or heuristically learning features. Such a reinforcement learning method combined with deep learning is called deep reinforcement learning. Therefore, the interval self-maintenance between the aircrafts is realized by using the deep reinforcement learning method, and the method has practical significance for improving the sector operation efficiency and reducing potential conflicts.

Disclosure of Invention

The purpose of the invention is as follows: the method aims to provide a control decision suggestion for a controller, realize automatic aircraft conflict resolution and autonomous flight safety interval maintenance, and improve the operation efficiency of the sector in a high-density operation state.

The method is based on the current operating control automation system, acquires basic situation information such as the position, speed, course and the like of flights in a sector, and sends speed-regulating and height-regulating instructions to flights with potential conflicts according to information such as the structure of a sector standard departure and departure program, the aircraft wake interval type and the like, so that the potential conflicts are eliminated, continuous normal operation of the sector under high-density operation is realized, and the control pressure of a controller under the condition is reduced.

In order to achieve the purpose, the method provides an aircraft interval management decision method based on deep reinforcement learning, so that the situation in the current sector is analyzed, speed regulation suggestions are provided for flights which may conflict, and potential conflicts are resolved.

The invention comprises the following steps:

step 1: defining action and state space of a deep reinforcement learning environment for aircraft flight command;

step 2: constructing an interval fine decision deep reinforcement learning network of the aircraft;

and step 3: training an interval fine decision deep reinforcement learning network of the aircraft;

and 4, step 4: and fine management of the aircraft interval is realized through an aircraft interval fine decision deep reinforcement learning network.

The step 1 comprises the following steps:

the method comprises the steps that two deep reinforcement learning intelligent bodies are used for selecting an intelligent body and an action selecting intelligent body for a flight respectively, wherein the state space of the flight selecting intelligent body is position information, course information and model information of all controllable flights in a current sector, and the action space is that less than or equal to two flights are selected from all controllable flights in the current sector for maneuvering; the state space of the action selection agent selects the standby movable flight selected by the agent for the flight, the position information, the course information and the model information of the three flights closest to the standby movable flight, and the distance from the standby movable flight, and the action space is the maneuvering action of the current flight at the next moment.

The step 2 comprises the following steps:

the aircraft interval fine decision depth reinforcement learning network comprises a flight selection intelligent agent and an action selection intelligent agent, wherein the flight selection intelligent agent comprises a Target value calculation network Target Q and an action value calculation network Eval Q; the Target Q network is used for training the Eval Q network and judging the output of the Eval Q network;

the action selection intelligent body comprises a Target value calculation network Target Q, an action value calculation network EvalQ and two long-short term memory neural networks LSTM, wherein the Eval Q network is used for receiving the positions, the speeds and the heights of the standby movable flight and three flights nearest to the standby movable flight selected by the flight selection intelligent body at the current moment, outputting the action values Q of all optional actions of the standby movable flight at the current moment, selecting the action with the highest action value to execute, the Target Q network is used for training the Eval Q network and judging the output of the Eval Q network, and the LSTM network is used at the last ends of the Target Q network and the Eval Q network and is used for processing time sequence data generated by flight after the flight enters a sector to judge the future movement trend of the Target flight.

The step 3 comprises the following steps:

step 3.1: initializing parameters of the deep reinforcement learning algorithm, including the total number of rounds E of algorithm iteration and the characteristic dimension n of the state space_sDimension n of motion space_aThe step length alpha of each updating parameter, the attenuation factor gamma of the action value function, the action exploration rate belongs to, the weight parameters of the flight selection intelligent body and the action selection intelligent body Eval Q network, the weight parameter of the initialized Target Q network is the same as that of the Eval Q network, the soft updating step length tau of the initialized Target Q network, the number m of batch training samples, the size d of the experience playback pool and the playback data volume d of starting training_startAnd the number c of updating rounds of the Target Q network; initializing a simulation environment state;

step 3.2: receiving real-time broadcast type automatic correlation surveillance radar ADS-B data, screening all controllable flights in the current time period, and acquiring longitude, latitude, altitude, speed, course and model information from the ADS-B data; normalizing longitude, latitude, altitude, speed and heading information, and scaling the data to [0,1]Obtaining normalized features in the interval, coding the machine type information one-hot to obtain one-hot feature vector, and obtaining the one-hot feature vectorThe normalized features are spliced with one-hot feature vectors to form feature vectors of the environment at the current moment

Step 3.3: feature vector of current time of using environment

Obtaining flight selection agent action as input of Eval Q network of flight selection agent

Feature vector of environment at current moment

And

stitching to form a new eigenvector

And inputting the motion selection intelligent agent into the Eval Q network to obtain the motion of the motion selection intelligent agent

Step 3.4: will be provided with

Executing in a simulation environment, obtaining new longitude and latitude, altitude, course and speed information from ADS-B data after waiting for the flight to execute the action, and forming the characteristic vector of the next moment by the method of step 3.2

Step 3.5: calculating a reward function r for an action selection agent_fcaAnd action selection agent's inbound flight altitude descent reward function r_faaJudging whether the current training is performedNeeds to be ended, gets the ending identifier is _ end_tWill be

And

storing the experience data into respective experience playback pools;

step 3.5: if the current simulation is not finished, repeating the steps 2 to 8, and if the current simulation is finished, starting the next round of simulation;

step 3.6: when the data amount in the experience playback pool is greater than or equal to d_startWhen the training is started, the training process is started;

step 3.7: and if the number of the current training rounds is integral multiple of c, updating the weight parameters of the Target Q network by adopting a soft updating strategy.

Step 3.2 comprises: will be provided with

As the input of the Eval Q network of the flight selection agent, outputting an action value set Q corresponding to all actions_t。

Step 3.3 includes: feature vector of environment at current moment

The action value set Q of the currently selected flight is calculated by forward propagation in the Eval Q network of the input flight selection agent_tWhile generating [0,1 ]]Random number n within interval_randomIf n is_random<E, e is (0, 1), e represents any value between 0 and 1, the initial value of the e is 0.99, the e is multiplied by a threshold value of 0.95 along with the end of each training round, and when the e is<When the content is 0.1, setting the epsilon to 0, and randomly selecting an action in the action space, namely randomly selecting one flight or two flights as target flights; otherwise, selecting the flight in the sector with the maximum action value as the target flight and storing as

Feature vector of current time of environment

And with

Stitching to form a new eigenvector

In an Eval Q network with input action selection agents, the value of each maneuver is calculated through forward propagation, and 0,1 is generated]Random number n within interval_randomIf n is_random<E, randomly selecting a mechanical action in the action space; otherwise, selecting the action with the maximum action value as the current maneuver action and storing the maneuver action as

Step 3.5 comprises: calculating the reward function r of the flight selection agent by adopting the following formula_fcaAnd action selection agent's inbound flight altitude descent reward function r_faa：

r_fca＝1000/dis_flights+10000/dis_airport (1)

Step 3.6 comprises: judging whether the current training needs to be finished or not, and using x for flight i_i，y_i，h_iRespectively representing the current longitude, latitude, altitude of the flight,

respectively representing the longitude, latitude and altitude of the destination point, if all flights arrive at the destination point, namely:

or a flight exceeds the boundary of the sector, where sector represents the sector longitude and latitude boundary, h_min，h_maxRespectively represent the upper and lower bounds of the sector height:

or the sector handover condition is not satisfied:

or two flights have collided, where δ represents a distance safety threshold:

if one of the above four conditions occurs, the end identifier is _ end is used_tThe variable is stored as True value True, otherwise it is stored as False value False.

The step 4 comprises the following steps:

step 4.1: establishing network connection with a control automation system, interacting with the control automation system in a message middleware communication mode, extracting current control sector structure information, selecting a corresponding trained model according to the current control sector structure information, and reloading trained model parameters;

step 4.2: receiving the longitude and latitude, the height, the speed course and the model information of the flight in the current sector from a control automation system, normalizing and splicing the longitude and latitude, the height, the speed course and the model information of the flight in the current sector into a flight characteristic vector which is used as the input of a flight selection intelligent agent; the flight selection agent selects flights to be regulated according to the current situation, the flights are spliced with flight information characteristic vectors to be used as input of the action selection agent, the action selection agent selects the maneuver action to be executed by the current flight, and a corresponding control instruction is generated;

step 4.3: synthesizing a control instruction into a control voice, pushing the control voice into a control automation system through a message middleware, and displaying the control voice on a control interface; and monitoring the execution condition of the pilot in real time through the return situation of the control automation system, retransmitting the instruction when the execution instruction is inconsistent or is not executed, and cancelling the command of the flight after commanding the flight to the transit point.

The invention has the following beneficial effects:

1. improving sector operation efficiency under busy conditions

Taking the current approach control as an example, when the number of flights in a sector reaches a certain value, a controller can preferentially adopt a flight spiral waiting control means under the condition that a potential conflict cannot be mediated, and a situation of multiple spirals can occur under an extreme condition. The method can solve the potential conflict without using a hovering waiting means, improve the sector capacity and the operation efficiency, reduce the average flight time of flights and save fuel economy. The control load of the controller is lightened.

2. Reducing controller workload

The method can provide a control decision suggestion for the controller, and the controller only needs to judge whether the decision needs to be executed according to the current situation in the sector, thereby reducing the control load of the controller in a busy operation state.

3. Improve 4D prediction accuracy and facilitate continuous descending/continuous ascending operation

At present, one reason why the 4D forecast longitude is inaccurate is that the flight path of the flight in the approach sector is affected by factors such as a control strategy and the like, and the flight path cannot fly according to a standard approach and departure procedure strictly, so that the path in the approach sector cannot be forecasted. After the method is used, the airplane can avoid collision along a standard approach and departure procedure in the approach sector. Furthermore, the standard entering and leaving field program considering continuous descending/continuous ascending is beneficial to the running of the continuous descending/continuous ascending program after the method is used.

Drawings

The foregoing and other advantages of the invention will become more apparent from the following detailed description of the invention when taken in conjunction with the accompanying drawings.

FIG. 1 is a flow chart of a method for fine management of aircraft separation based on deep reinforcement learning.

Fig. 2 is a schematic diagram of a network architecture.

Fig. 3 is a structural diagram of a sector of the Nanjing Access AP 01.

Fig. 4 is a swing height sectional view.

Detailed Description

The application scene of the method is that the ADS-B data is required to be accessed when the traffic control seats in the control area are approached, and the method carries out targeted training for different control area structures and runway operation modes. When the above factors change, such as seat setting change, sector structure adjustment, and switching to the operation mode, calculation needs to be performed on the trained model corresponding to the switched content switching. The invention comprises the following steps:

1. defining action and state space of a deep reinforcement learning environment for aircraft flight command;

the invention uses two deep reinforcement learning agents, which respectively select agents for flights and agents for actions. The state space of the flight selection agent is position information (longitude, latitude and height), course information and machine type information of all controllable flights in the current sector, and the action space is that less than or equal to two flights are selected from all controllable flights in the current sector for maneuvering; the state space of the action selection agent selects the standby flight selected by the agent for the flight and the position information (longitude, latitude and height) of three flights nearest to the flight, the course information, the model information and the distance from the flight, and the action space is the maneuvering action of the next moment of the current flight, such as descending and ascending of the altitude, acceleration and deceleration of the speed and the like.

2. Method for constructing interval fine decision deep reinforcement learning network of aircraft

The deep reinforcement learning network comprises an agent for selecting flights and an agent for selecting actions. The flight selection agent consists of a Target Q network and an Eval Q network, wherein the Eval Q network is used for receiving the environment state at the current moment, outputting the action values Q of all selectable actions at the current moment and selecting the action with the highest action value to execute; the Target Q network is used for training the Eval Q network and judging the output of the Eval Q network. The action selection intelligent body consists of a Target Q network, an Eval Q network and two LSTM networks, wherein the Eval Q network is used for receiving information of a standby flight selected by the flight selection intelligent body at the current moment and positions, speeds, heights and the like of three flights nearest to the flight, outputting action values Q of all selectable actions of the flight at the current moment, selecting an action with the highest action value to execute, the Target Q network is used for training the Eval Q network and judging the output of the Eval Q network, and the LSTM networks are used at the last ends of the Target Q network and the Eval Q network and used for processing time sequence data to judge the future movement trend of the Target flight.

3. Deep reinforcement learning network for training interval fine decision of aircraft

The training steps of the aircraft interval fine decision deep reinforcement learning network are as follows

Step 3.1: initializing parameters of the deep reinforcement learning algorithm, including the total number of rounds E of algorithm iteration and the characteristic dimension n of the state space_sDimension n of motion space_aUpdating the step length alpha of the parameters each time, the attenuation factor gamma of the action value function, the action exploration rate epsilon, the weight parameters of the Eval Q network, initializing the weight parameters of the Target Q network to be the same as the Eval Q network, the soft updating step length tau of the Target Q network, the number m of samples trained in batches, the size d of an experience playback pool, and the playback data volume d of the starting training_startThe Target Q network updates the round number c. The simulation environment state is initialized.

Step 3.2: processing input data, receiving real-time ADS-B data, screening all controllable flights in the current time period, and acquiring information such as longitude, latitude, altitude, speed, course, model and the like from the ADS-B data. Normalizing longitude, latitude, altitude, speed and heading information according to possible maximum value and minimum value thereof, and scaling the data to [0,1 ]]Within a regionThe model information is coded by one-hot, and the normalized features are spliced with one-hot feature vectors to form the feature vectors of the environment at the current moment

Step 3.3: and inputting the feature vector of the current moment of the environment into an Eval Q network of the flight selection agent, and calculating the action value of each currently selected flight through forward propagation. Simultaneous generation of [0,1]Random number n within interval_random. If n is_random<Selecting an action in the action space at random, namely selecting one or two flights as target flights at random; otherwise, selecting the flight in the sector with the maximum action value as the target flight and storing as

Feature vector of environment

And

stitching to form a new feature vector

The value of each maneuver is calculated by forward propagation in the EvalQ network of the input action selection agent. Simultaneous generation of [0,1]Random number n within a span_random. If n is_random<Selecting a mechanical action in the action space at random; otherwise, selecting the action with the maximum action value as the current maneuver action and storing the maneuver action as

Step 3.4: will be provided with

The method is applied to a simulation environment, and new data is acquired from ADS-B data after the flight executes actionsLongitude and latitude, altitude, course, speed, etc. to form

Feature vectors, generated according to step 3.5

A feature vector;

step 3.5: training reinforcement learning network calculates flight selection reward according to formulas (1) and (2)

And action selection rewards

r_fca＝1000/dis_flights+10000/dis_airport (1)

Step 3.6: judging whether the current training needs to be finished or not, and using x for flight i_i，y_i，h_iIndicating the current longitude, latitude, altitude,

representing the longitude, latitude, altitude of the destination point, if all flights arrive at the destination point, then:

or a flight is beyond a sector boundary (latitude and longitude boundary or altitude boundary):

or the sector handover condition is not satisfied:

or two flights conflict

If one of the above four conditions occurs, is _ end is_tThe variable is stored as True, otherwise it is stored as False.

Step 3.7: will be provided with

And

storing the data into respective experience playback pools when the data amount in the experience playback pools is less than d_startAnd then, repeating the steps from 3.2 to 3.8,

step 3.7.1 when the data size of the empirical playback pool is greater than d_startAt that time, m samples are sampled from the empirical playback set,

step 3.7.2 calculates the target Q value for each sample as shown in equation (7). When the amount of data in the empirical playback pool is greater than d, the oldest data is deleted with each new addition.

Step 3.7.3, calculating the mean square error loss (loss) of all samples, and updating the weight parameters of the Eval Q network through inverse gradient propagation, wherein the calculation formula of the loss is as follows:

step 3.8: and if the current training round number is an integral multiple of c, updating the weight parameters of the flight selection agent and the action selection agent Target Q network by adopting a soft updating strategy, wherein w' represents the weight parameters of the Target Q network, and w represents the weight parameters of the Eval Q network. The formula for the soft update strategy is as follows.

w′＝τw+(1-τ)w′ (9)

4. Aircraft interval fine decision deep reinforcement learning network for realizing aircraft interval fine management

Step 4.1: after the model training is finished, the model training system establishes network connection with the control automation system, and interacts with the control automation system in a message middleware communication mode. Firstly, extracting the structural information of the current control sector, the standard approach and departure route, the runway running mode, the airspace restriction information and the like, and selecting a corresponding trained model according to the information and overloading the trained model parameters.

Step 4.2: and receiving the flight information such as the longitude and latitude, the height, the speed and the course, the model and the like of the flight in the current sector from the control automation system. And normalizing and splicing the data into a feature vector which is used as the input of the flight selection intelligent agent. And the flight selection agent selects flights to be regulated according to the current situation, the flights are spliced with the flight information characteristic vectors to be used as the input of the action selection agent, and the action selection agent selects the maneuver action to be executed by the current flight to generate a corresponding control instruction.

Step 4.3: and synthesizing the suggested control instruction into control voice, pushing the control voice into a control automation system through message middleware, and displaying the control voice on a control interface. And monitoring the execution condition of the pilot in real time through the return situation of the control automation system, retransmitting the command when the execution command is inconsistent or is not forceful, and canceling the command of the flight when the command flight reaches the transfer point. The method comprises the steps of obtaining data information in a control automation system once every four seconds, calculating current flights to be commanded and action instructions to be sent specifically after receiving the information every time, evaluating the state after executing the instructions, marking the flights possibly with conflicts on a situation map when the potential conflict flights possibly exist in a sector, setting the potential conflict flights as the current command flights, and eliminating the potential conflicts in time.

Examples

The overall process of the present invention is shown in FIG. 1. The invention provides an aircraft interval management decision method based on deep reinforcement learning, which comprises the following steps:

the invention uses two deep reinforcement learning agents, which respectively select agents for flights and agents for actions. The state space of the flight selection agent is position information (longitude, latitude and height), course information and machine type information of all controllable flights in the current sector, and the action space is that less than or equal to two flights are selected from all controllable flights in the current sector for maneuvering; the state space of the action selection agent selects the standby flight selected by the agent for the flight and the position information (longitude, latitude and altitude), the course information, the model information and the distance from the flight for the three flights closest to the flight, and the action space is the maneuvering action at the next moment of the current flight, such as descending and ascending of the altitude, acceleration and deceleration of the speed and the like.

2. Constructing an aircraft interval fine decision deep reinforcement learning network

The aircraft flight command depth reinforcement learning intelligent agent comprises a flight selection intelligent agent and an action selection intelligent agent. The flight selection agent consists of a Target Q network and an Eval Q network, wherein the Eval Q network is used for receiving the environmental state at the current moment, outputting the action values Q of all selectable actions at the current moment and selecting the action with the highest action value to execute; the Target Q network is used for training the Eval Q network and judging the output of the Eval Q network. The action selection intelligent body is composed of a Target Q network, an Eval Q network and two LSTM networks, wherein the Eval Q network is used for receiving information of a standby flight selected by the flight selection intelligent body at the current moment and positions, speeds, heights and the like of three flights nearest to the flight, outputting action values Q of all selectable actions of the flight at the current moment, selecting the action with the highest action value to execute, the Target Q network is used for training the Eval Q network and judging the output of the Eval Q network, and the LSTM networks are used at the rearmost ends of the Target Q network and the Eval Q network and used for processing time sequence data to judge the future movement trend of the Target flight. The algorithm structure is shown in fig. 2.

The training strategy flows of the flight selection agent and the action selection agent are basically the same, so that the flight selection agent is taken as an example to explain the training flow of aircraft flight command and interval deployment deep reinforcement learning.

Step 3.1: initializing parameters of the deep reinforcement learning algorithm, including the total round number E of algorithm iteration and the characteristic dimension n of the state space_sDimension n of motion space_aUpdating the step length alpha of the parameters each time, the attenuation factor gamma of the action value function, the action exploration rate epsilon, the weight parameters of the Eval Q network, initializing the weight parameters of the Target Q network to be the same as the Eval Q network, the soft updating step length tau of the Target Q network, the number m of samples trained in batches, the size d of an experience playback pool, and the playback data volume d of the starting training_startThe Target Q network updates the round number c. The simulation environment state is initialized.

Step 3.2: and acquiring information such as longitude, latitude, altitude, speed, course, model and the like from the ADS-B data. The longitude, latitude, altitude, speed and heading information are normalized according to the possible maximum value and minimum value, and the data are scaled to [0,1 ]]In the interval, the model information is coded by one-hot, and the normalized features are spliced with one-hot feature vectors to form feature vectors of the environment at the current moment

Step 3.3: and inputting the characteristic vector of the current moment of the environment into an Eval Q network of the flight selection intelligent agent, and calculating the action value of each currently selected flight through forward propagation. Simultaneous generation of [0,1]Random number n within a span_random. If n is_random<E, then randomly select one action in the action space, i.e.Randomly selecting one or two flights as target flights; otherwise, selecting the flight in the sector with the maximum action value as the target flight and storing as

Feature vector of environment

And

stitching to form a new eigenvector

Step 3.4: will be provided with

The method is applied to a simulation environment, and new longitude and latitude, altitude, course, speed and other data are obtained from ADS-B data after the flight executes actions to form

Feature vectors, regenerated according to step 3.2

A feature vector;

step 3.5: prize r_tThe calculation method of (2) uses a reward shaping method, namely, a smaller reward value is returned for each non-key action to solve the problem of sparse reward distribution so as to accelerate the training speed.Reward function r for selecting agents on flights_fcaAnd action selection agent's inbound flight altitude descent reward function r_faaFor example, the calculation of the reward is as follows:

r_fca＝1000/dis_flights+10000/dis_airport (1)

step 3.6: judging whether the current training needs to be finished or not, and using x for flight i_i,y_i,h_iIndicating the current longitude, latitude, altitude,

(1) All flights arrive at the target point, namely, the incoming flights successfully land at the airport, and the outgoing flights arrive at the corridor intersection. For flight i, use x_i,y_i,h_iIndicating the current longitude, latitude, altitude,

representing the longitude, latitude, and altitude of the target point, the condition can be expressed as:

(2) A flight that exceeds a sector boundary (latitude and longitude boundary or altitude boundary) may be represented as:

(3) The sector handover condition is not satisfied, which can be expressed as:

(4) The conflict occurs between two flights, and the situation can be expressed as follows:

step 3.7: will be provided with

And

storing the data into respective experience playback pools, when the data amount in the experience playback pools is less than d_startAnd (5) repeating the step 3.2 to the step 3.8.

Step 3.7.1: sampling m samples from an empirical playback set

Step 3.7.2: calculating a target Q value y for each sample_jIf the simulation is finished, the Target Q value is the reward returned by the simulation environment at the end, otherwise, the Target Q value is the reward of the simulation environment plus the estimation of the attenuated Target Q network on the action of the EvalQ network:

step 3.7.3: calculating the mean square error loss (loss) of all samples, and updating the weight parameters of the Eval Q network through inverse gradient propagation, wherein the calculation formula of the loss is as follows:

step 3.8: and if the number of the current training rounds is integral multiple of c, updating the weight parameter of the Target Q network by adopting a soft updating strategy, representing the weight parameter of the Target Q network by using w', and representing the weight parameter of the Eval Q network by using w.

The formula of the soft update strategy is as follows

w′＝τw+(1-τ)w′ (9)

4. Aircraft interval fine management is realized through an aircraft interval fine decision deep reinforcement learning network

Step 4.1: after the model training is finished, the model training system establishes network connection with the control automation system, and interacts with the control automation system in a message middleware communication mode. Firstly, extracting the structure information of a current control sector, a standard approach and departure route, a runway operation mode, airspace restriction information and the like, selecting a corresponding trained model according to the information, and selecting a model trained under the simulation environment of the east-oriented operation of the near 01 sector of Nanjing when the near 01 sector of Nanjing is controlled, and reloading the trained model parameters.

And 4.2: and receiving the flight information such as the longitude and latitude, the height, the speed and the course, the model and the like of the flight in the current sector from the control automation system. And normalizing and splicing the flight parameters into a feature vector to be used as the input of the flight selection agent. The flight selection agent selects flights to be regulated according to the current situation, the flights are spliced with flight information characteristic vectors to be used as input of the action selection agent, the action selection agent selects the maneuver actions which the current flight should execute, such as an altitude ascending and descending instruction, a flight acceleration and deceleration instruction and the like, and generates corresponding control instructions, such as 'east 5254 descends to 2700 for maintenance'.

Step 4.3: and synthesizing the suggested control instruction into control voice, pushing the control voice into a control automation system through message middleware, and displaying the control voice on a control interface. And monitoring the execution condition of the pilot in real time through the return situation of the control automation system, retransmitting the instruction when the execution instruction is inconsistent or is not forceful, and cancelling the command of the flight after the command flight reaches the transfer point. The method comprises the steps of obtaining data information in a control automation system once every four seconds, calculating current flights to be commanded and action instructions to be sent specifically after receiving the information every time, evaluating the state after executing the instructions, marking the flights possibly with conflicts on a situation map when the potential conflict flights possibly exist in a sector, setting the potential conflict flights as the current command flights, and eliminating the potential conflicts in time. The method is verified in the Nanjing approach AP01 sector by generating historical flight flow in a simulation system, so that the approach and departure command and control of multiple flights are realized, the structure of the Nanjing approach AP01 sector is shown in figure 3, and the command height profile is shown in figure 4.

The present invention provides a method for determining aircraft interval management based on deep reinforcement learning, and a variety of methods and approaches for implementing the technical solution are provided, and the above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, a plurality of improvements and modifications may be made without departing from the principle of the present invention, and these improvements and modifications should also be regarded as the protection scope of the present invention. All the components not specified in this embodiment can be implemented by the prior art.

Claims

1. An aircraft interval management decision method based on deep reinforcement learning is characterized by comprising the following steps:

step 1: defining an action space and a state space of a deep reinforcement learning environment for aircraft flight command;

and 4, step 4: the fine management of the interval of the aircraft is realized through an interval fine decision deep reinforcement learning network of the aircraft;

the step 1 comprises the following steps:

the method comprises the steps that two deep reinforcement learning intelligent bodies are used for selecting an intelligent body and an action selecting intelligent body for a flight respectively, wherein the state space of the flight selecting intelligent body is position information, course information and model information of all controllable flights in a current sector, and the action space is that less than or equal to two flights are selected from all controllable flights in the current sector for maneuvering; the state space of the action selection agent selects the standby movable flight selected by the agent for the flight, the position information, the course information, the machine type information of the three flights closest to the standby movable flight and the distance from the standby movable flight, and the action space is the maneuvering action of the current flight at the next moment;

the step 2 comprises the following steps:

the action selection intelligent body comprises a Target value calculation network Target Q, an action value calculation network Eval Q and two long-short term memory neural networks LSTM, wherein the Eval Q network is used for receiving the positions, the speeds and the heights of the standby movable flight and three flights nearest to the standby movable flight selected by the flight selection intelligent body at the current moment, outputting the action values Q of all selectable actions of the standby movable flight at the current moment, selecting the action with the highest action value to execute, the Target Q network is used for training the Eval Q network and judging the output of the Eval Q network, and the LSTM network is used at the last ends of the Target Q network and the Eval Q network and is used for processing time sequence data generated by flight after the flight enters a sector to judge the future movement trend of the Target flight;

the step 3 comprises the following steps:

step 3.1: initializing parameters of the deep reinforcement learning algorithm, including the total round number E of algorithm iteration and the characteristic dimension n of the state space_sDimension n of motion space_aThe step length alpha of each updating parameter, the attenuation factor gamma of the action value function, the action exploration rate belongs to, the weight parameters of the flight selection intelligent body and the action selection intelligent body Eval Q network, the weight parameter of the initialized Target Q network is the same as that of the Eval Q network, the soft updating step length tau of the initialized Target Q network, the number m of batch training samples, the size d of the experience playback pool and the playback data volume d of starting training_startAnd Target Q networkUpdating the number of rounds c; initializing a simulation environment state;

step 3.2: receiving real-time broadcast type automatic correlation monitoring radar data, screening all controllable flights in the current time period, and acquiring longitude, latitude, height, speed, course and model information from ADS-B data; the longitude, latitude, altitude, speed and course information are normalized, and the data is scaled to [0, 1%]Obtaining normalized features in the interval, coding the model information one-hot to obtain one-hot feature vector, and splicing the normalized features and the one-hot feature vector to form the feature vector of the environment at the current moment

Step 3.3: feature vector of current time of using environment

Obtaining flight selection agent actions as input of Eval Q network of flight selection agent

Feature vector of environment at current moment

And

stitching to form a new eigenvector

Step 3.4: will be provided with

Executing in a simulation environment, obtaining new longitude and latitude, altitude, course and speed information from ADS-B data after waiting for flight to execute actions, and forming a feature vector of the next moment

Step 3.5: calculating a reward function r for an action selection agent_fcaAnd the inbound flight altitude descent reward function r of the action selection agent_faaJudging whether the current training needs to be ended or not to obtain an end identifier is _ end_tWill be

And

storing the experience data into respective experience playback pools;

step 3.6: if the current simulation is not finished, repeating the steps from 3.2 to 3.8, and if the current simulation is finished, starting the next round of simulation;

step 3.7: when the data amount in the experience playback pool is larger than or equal to d_startWhen the training is started, the training process is started;

step 3.8: and if the number of the current training rounds is integral multiple of c, updating the weight parameters of the Target Q network by adopting a soft updating strategy.

2. The method according to claim 1, characterized in that step 3.2 comprises: will be provided with

3. The method according to claim 2, characterized in that step 3.3 comprises: feature vector of environment at current moment

In the Eval Q network of the input flight selection agent, the action value set Q of the current selected flight is calculated through forward propagation_tWhile generating [0,1 ]]Random number n within a span_randomIf n is_random<E is an action exploration rate, the value of the e is any value between 0 and 1, the e is multiplied by a threshold value along with the end of each round of training, and when the e is<When the content is 0.1, setting the epsilon to 0, and randomly selecting an action in the action space, namely randomly selecting one flight or two flights as target flights; otherwise, selecting the flight in the sector with the maximum action value as the target flight and storing the flight as

Feature vector of environment at current moment

And

stitching to form a new eigenvector

In an Eval Q network with input action selection agents, the value of each maneuver is calculated through forward propagation, and 0,1 is generated]Random number n within a span_randomIf n is_random<If epsilon, selecting a motor action in the action space at random; otherwise, selecting the action with the maximum action value as the current maneuver and storing the maneuver as

4. A method according to claim 3, characterised in that step 3.5 comprises: calculating the reward function r of the flight selection agent by adopting the following formula_fcaAnd action selection agent's inbound flight altitude descent reward function r_faa：

r_fca＝1000/dis_flights+10000/dis_airport (1)

5. The method of claim 4, wherein step 3.5 comprises: judging whether the current training needs to be finished or not, and for the flight i, using x_i,y_i,h_iRespectively representing the current longitude, latitude, altitude of the flight,

longitude, latitude, and altitude of the target point are respectively represented:

(1) All flights arrive at the destination point, that is, all inbound flights successfully descend to the airport, and all outbound flights arrive at the corridor exit, the condition is expressed as:

(2) If a flight exceeds a sector boundary, where src represents a sector longitude and latitude boundary, h_min,h_maxRepresenting the upper and lower bounds of the sector height, respectively, the situation is expressed as:

(3) The sector handover condition is not satisfied, which is represented as:

(4) Two flights have collided, where δ represents a distance safety threshold, and this condition is expressed as:

if one of the above four cases occurs, the end identifier is _ end is used_tThe variable is stored as True value True, otherwise it is stored as False value False.

6. The method of claim 5, wherein step 4 comprises:

and 4.2: receiving the longitude and latitude, the height, the speed course and the model information of the flight in the current sector from a control automation system, normalizing and splicing the longitude and latitude, the height, the speed course and the model information of the flight in the current sector into a flight characteristic vector which is used as the input of a flight selection intelligent agent; the flight selection agent selects flights to be regulated according to the current situation, the flights are spliced with flight information characteristic vectors to serve as input of the action selection agent, the action selection agent selects the maneuver action to be executed by the current flight, and a corresponding control instruction is generated;