CN114781704A

CN114781704A - Flight delay prediction method based on station-passing flight guarantee process

Info

Publication number: CN114781704A
Application number: CN202210368680.7A
Authority: CN
Inventors: 羊钊; 陈怡欣; 宋溢露; 曾维理; 包杰; 丛玮
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2022-04-08
Filing date: 2022-04-08
Publication date: 2022-07-22
Anticipated expiration: 2042-04-08
Also published as: CN114781704B

Abstract

The invention discloses a flight delay prediction method based on a station-passing flight guarantee process, which comprises the following steps: collecting and cleaning time data generated on flow nodes when flights pass by stations in a field in a specific time period; calculating a time difference value set of the actual take-off time and the planned departure time of all flights in the original data set to obtain a standard time period of each node difference; constructing each flow node of the flight into a non-European space graph network structure with a logical relationship; calculating the time difference between the standard time interval of the node difference and the delayed flight, loading the time difference into a graph network structure as the node characteristic, and constructing and packaging a graph data set; building a graph convolution neural network for information transmission and information updating; and obtaining an optimal flight delay time prediction model. The method corrects incorrect time node data by constructing the graph network structure with the logical relation, predicts flight delay time, can consider the correlation of different delay occurrence links, and improves prediction precision.

Description

Flight delay prediction method based on station-passing flight guarantee flow

Technical Field

The invention belongs to the technical field of air traffic management, and particularly relates to a flight delay prediction method based on a station-passing flight guarantee process.

Background

With the rapid development of the civil aviation industry, aviation travel more closely focuses on improving the efficiency and quality of flight service. However, flight delay problems due to rapid increase in flight volume and restriction of airspace resources have become more serious, and have become a factor of deterioration of flight service quality.

At present, civil aviation flights run continuously at high positions, and airports, air pipes and navigation drivers simultaneously participate in interweaving in multiple ways when the flights cross stations, so that full-load operation of the system is guaranteed, and the flight punctuality rate is difficult to improve. Although a great deal of research aiming at the problem of flight delay exists at present, most of the current delay prediction methods regard the flight execution process as a whole, and link and front-back relevance generated by delay are not deeply analyzed according to the preorder and postorder flow of flight execution, so that the delay prediction precision is not high. Based on the operation process node data, the method considers the node time and the front-back correlation in the aircraft station-passing process when researching the ground delay, can research the flight delay prediction problem from a new perspective, is favorable for improving the flight delay prediction precision, and provides a scientific method for improving the airport operation efficiency, accelerating the station-passing guarantee of the flight department and relieving the frequent flight delay.

Disclosure of Invention

Aiming at the defects of the prior art, the invention aims to provide a flight delay prediction method based on a station-passing flight guarantee process, so as to solve the problems that the correlation before and after a delay link generated when an airport station passes is not considered in the prior art and the flight delay prediction precision is not high; the method can start from the time data generated by various operations of the flight in the ground guarantee flow, considers the relevance of the flow nodes before and after the flight passes the station, and converts the time data on the flow nodes into the graph network structural characteristics of the non-European space to predict the flight delay.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows:

the invention relates to a flight delay prediction method based on a station-passing flight guarantee process, which comprises the following steps of:

(1) collecting time data generated by each flow node when the flight passes by the airport in a specific time period, cleaning the original data of the time of each flow of the flight according to the logical relation of the flight operation flow, and taking the original data as an original data set T_p；

(2) Computing a raw data set T_pTime difference value set D of actual take-off time and planned departure time of all flights in the system_diffAccording to the flight take-off normal time range [0, t ]_r]Wherein t is_rRepresenting the maximum normal time of flight departure, and dividing the delayed flight time set T_delayAnd non-delayed flight time set T_non-delaySeparately calculating delayed flight time sets T_delayAnd non-delayed flight time set T_non-delayThe time difference value set on each subsequent-preamble flow node of the medium flight is marked as Diff_delayAnd Diff_non-delayCounting time difference set Diff on the following-preamble flow nodes of the non-delayed flight_non-delaySelecting the data segment of the upper quartile QU and the lower quartile QL of the time difference value of each subsequent-preorder process node as a standard time segment set D_std；

(3) Aiming at the forward and backward connection characteristics of each flight flow node, constructing each flow node of the flight into a non-European space graph network structure G with a logical relationship;

(4) standard time period set D using time difference values of each subsequent-preceding flow node_stdCalculating D_stdWith delayed flight on the successor-preamble flow nodeSet of difference values Diff between_delaySet of distances T_stdSet the distances T_stdInputting a characteristic input set X serving as a process node, loading the characteristic input set X serving as the process node into a graph network structure G, and arranging an edge set E', an index set I, a node characteristic set X and a time difference value set D of delayed flight actual takeoff time and planned departure time_diff-delayCollectively packaged into a graph dataset;

(5) constructing four graph convolution neural network models of double-layer GCN, double-layer GAT, double-layer GraphSAGE and combination of single-layer GCN and single-layer GraphSAGE by using the created graph network structure G;

(6) selecting three machine learning models, dividing a training set, a verification set and a test set, training the three machine learning models and four constructed graph convolution neural network models combining double-layer GCN, double-layer GAT, double-layer GraphSAGE, single-layer GCN and single-layer GraphSAGE, and obtaining optimal models of the three machine learning models and the four graph convolution neural network models by adjusting parameters;

(7) and comparing and evaluating the model result by using the obtained three machine learning models and the optimal model of the four graph convolution neural network models.

Further, the outbound flight flow node in step (1) includes: the system comprises a planned entry node, an actual landing node, an actual gear shift node, an open cabin door node, a close cabin door node, a boarding gate open node, a boarding gate close node, an actual gear shift removal node, an actual sliding start node, an actual take-off node and a planned departure node.

Further, the flight time data in step (1) includes: planned entry time, actual landing time, actual gear shift time, cabin door opening time, cabin door closing time, boarding gate opening time, boarding gate closing time, actual gear shift removing time, actual sliding starting time, actual take-off time and planned departure time.

Further, the specific process of the step (1) is as follows:

(11) checking the time context correlation of each collected flight time set, and eliminating abnormal values recorded in the time data (for example, a certain time point of the flight is not on the same day as the rest time points, the next time point of the flight is earlier than the previous time point, the previous time point of the flight is later than the next time point, and the like);

(12) according to the difference length of the opening time and the closing time recorded in each flight time set, classifying the data of the difference length exceeding 400, the opening time in the evening and the closing time in the next morning into an overnight flight set T_p-overClassifying the data of which the difference length does not exceed 400 or does not satisfy the door opening time in the evening and the door closing time in the next morning into a non-overnight flight set T_p-nonover；

(13) To collect the actual landing time t to each flight_alPlanned arrival time t_eaActual takeoff time t_adAnd planned departure time t_edCleaning each sample flight time set data for the reference execution time of the flight; processing abnormal and missing time close to actual landing time and actual takeoff time in priority, and dividing time data to be processed into two types, namely first type time and actual landing time t of the flight_alOr actual takeoff time t_adAssociating, wherein the second type of time is associated with two or more times before and after the flight; aiming at two types of abnormal values or missing values, respectively collecting T in the overnight flight set_p-overAnd non-overnight flight set T_p-nonoverCleaning by using a logical relation with complete time data on different process nodes;

(14) taking all time data after cleaning as an original data set T_pWherein T is_pComprising a set of overnight flights T_p-overAnd non-overnight flight set T_p-nonoverAnd p represents the number of flights.

Further, the step (13) specifically includes:

(131) for the first class of outliers, if the actual taxi-starting time t of the ith flight is_i-atIf the flight is missing or abnormal, the actual taxiing starting time formula of the ith flight is recalculated as follows;

in the formula, t_i-at(cal) represents the recalculated actual taxi-starting time of the ith flight, p-name represents the number of flights in the data set, t_i-adThe actual takeoff time of the ith flight is represented, and the p-noover and the p-over respectively represent the flight numbers of the non-overnight flights and the overnight flights;

(132) for the abnormal value of the second kind, if the actual door closing time t of the ith flight_i-cdIf the door is missing or abnormal, the actual door closing time formula of the ith flight is obtained by recalculation as follows;

in the formula, t_i-cd(cal) represents the actual closing door time of the ith flight calculated again, p-name represents the number of flights in the data set, t_i-boffIndicating the actual off-gear time, t, for the ith flight_i-odRepresenting the actual opening door time of the ith flight and p-noover and p-over representing the flight numbers of the non-overnight flights and the overnight flights, respectively.

Further, the specific process of the step (2) is as follows:

(21) computing a raw data set T_pTime difference value set D of actual takeoff time and planned departure time of each flight_diffThe formula is as follows:

in the formula, d_iRepresenting the calculated differential time, t, for the ith flight_i-adRepresenting the actual departure time, t, for the ith flight_i-edRepresenting the scheduled departure time of the ith flight;

(22) set of flights T at night_p-overAnd non-overnight flight set T_p-nonoverIn and respectively drawDelayed flight time set T_delayAnd non-delayed flight time set T_non-delayTwo types, the formula is as follows;

in the formula, m represents the number of flow nodes contained in one flight, p-name represents the number of flights in the data set, and p-notover and p-over represent the number of flights of non-overnight flights and overnight flights respectively;

(23) set of flights T at night_p-overAnd non-overnight flight set T_p-nonoverRespectively calculating delayed flight time sets T_delayAnd non-delayed flight time set T_non-delayTime difference value set Diff on each subsequent-preamble flow node of medium flight_delayAnd Diff_non-delayWherein the time difference value on the subsequent-preorder process nodes represents the difference value between the subsequent time and the preorder time on two adjacent process nodes, and comprises actual landing time-planned arrival time, actual gear-landing time-actual landing time, door opening time-gear opening time, gate closing time-door opening time, actual gear withdrawing time-door closing time, actual gear withdrawing time-gate closing time, actual gear starting time-actual gear withdrawing time, actual takeoff time-actual start coasting time and actual takeoff time-planned time;

(24) respectively counting overnight flight sets T_p-overAnd non-overnight flight set T_p-nonoverMedium non-delayed flight time set T_non-delayTime difference value set Diff at each of the following-preamble flow nodes of_non-delayThe mean value, the upper quartile, the lower quartile, the maximum value and the minimum value of the time difference value set are drawn by using a box diagram_non-delaySelecting the data segment of the upper quartile QU and the lower quartile QL of each time difference on each subsequent-preorder process node as a standard time segment set D_std。

Further, the specific process of step (3) is as follows:

(31) creating graph network structure G (V, E) aiming at used flight flow node data, G represents graph network structure after creation, V_aE V represents non-null by a finite number of flow nodes V ═ V₁,v₂,v₃,......v_nSet of points, v₁Represents the 1 st flow node, (v)_a,v_b) E denotes a finite edge E ═ E₁,e₂,e₃,......e_{p-delay*(m-1)}Set of edges, e₁Representing the 1 st edge, and p-delay representing the number of flights that delay a flight;

(32) respectively constructing an adjacency matrix and a degree matrix on the graph aiming at the graph network structure G, wherein the formula is as follows:

in the formula, A_abAn adjacency matrix representing the constructed graph network structure, a and b respectively represent the flow node number, v_aAnd v_bRespectively representing a flow node and b, R represents a real number field, N represents the number of all flow nodes of a delayed flight, N is p-delay m, p-delay represents the number of flight pieces of the delayed flight, m represents the number of the flow nodes contained in one flight, D_abA degree matrix representing the network structure of the graph.

Further, the specific process of step (4) is as follows:

(41) taking out the edge sets E of all delayed flight samples, converting the edge sets E through the process node numbers, sequentially arranging the edge sets E according to the order of the edges and the size of the process node numbers, and storing the arranged edge sets E' into a data set 1;

(42) arranging flow nodes belonging to different graphs in sequence according to indexes, wherein the flow nodes in each graph have the same index value to obtain an index set I, and storing the index set I into a data set 2;

in the formula, m represents the number of flow nodes contained in one flight, and p-delay represents the number of flights delaying the flight;

(43) using the set of standard time periods D for non-delayed flights obtained in step (24)_stdRespectively associated with overnight flight set T_p-overAnd non-overnight flight set T_p-nonoverDelayed flight time set T in_delayTime difference value set Diff on a following-preceding flow node of_delayComparing the flight time set T with the formula (8) to obtain the delayed flight time set T_delayTime difference set Diff with nodes of a subsequent-preceding flow_delaySet of distances T between_stdSet the distances T_stdThe feature input set X is used as a process node, the feature input set X of the process node is used as a node attribute and is sequentially arranged according to the flight sequence, and the arranged data is stored into a data set 3 according to a two-dimensional data format;

in the formula, T_stdIndicating a set of delayed flight times T_delayTime difference set Diff with nodes of a subsequent-preceding flow_delaySet of distances between, T_isDistance, QL, representing the s time difference of the ith flight from the standard time period_sLower quartile, QU, of standard time period representing the s-th time difference_sUpper quartile of standard period, Diff, representing the s-th time difference_delay-isA value representing the s-th time difference for the ith delayed flight, p-delay representing the number of delayed flights;

(44) collecting the time difference values D of the actual departure time and the planned departure time of all delayed flights in the sample_diff-delayAs graph attributes, arranging the time difference value sets D in sequence according to flight sequence and according to a two-dimensional data format_diff-delayStoring the tags into the data set 4;

(45) and packaging the data set 1, the data set 2, the data set 3 and the data set 4 into a graph data set together, so that the graph network structure taken out each time is a subset of the original graph network structure.

Further, the specific process of the step (5) is as follows:

(51) for the created graph network structure G, transmitting the features on the graph to the lower layer by using a propagation rule f, wherein the formula is as follows;

in the formula, H^(l)Features representing the l-th layer, A represents a graph network structure description of the flow node, where the adjacency matrix A is used_abDenotes, Z denotes the output, X⁽⁰⁾Representing and calculating a characteristic input set of a model needing to be input;

(52) mapping the delivery of layers to specific data, the formula is expressed as follows:

X^(l+1)＝f(X^(l),A) (10)

wherein f represents a propagation rule, X^(l)A feature input set representing the flow node of the l layer;

(53) in the constructed GCN layer, the propagation formula of the propagation rule f is as follows:

in the formula, X^(l+1)A feature output set representing the flow nodes of the l layers, D a degree matrix of the graph network structure of the input l layers, I_nShowing the self-circulation of the network structure of the graph, W^(l)A convolution kernel representing the l-th layer, i.e. a learnable weight, σ represents a nonlinear transformation;

(54) constructing a GAT layer, and realizing that different weights are distributed to different edges through attention coefficients;

(55) constructing a GraphSAGE layer, converting a full graph training mode of the process node characteristics into a small batch training mode taking the process node characteristics as the center through neighbor sampling, and performing characteristic aggregation on information of neighbor process nodes by adopting an aggregation function;

(56) and (2) aiming at different graph convolution neural network layers, considering the time of the station-crossing flight flow as characteristic input and a graph guide task needing to be predicted, and constructing a double-layer graph convolution neural network, wherein the method comprises the following steps of: a double-layer GCN layer, a double-layer GAT layer, a double-layer GraphSAGE layer, a single-layer GCN layer and a single-layer GraphSAGE layer are connected to form a graph neural network; each double-layer graph neural network is connected with the double-layer fully-connected neural network and the final pooling layer, and the node guide tasks and the edge guide tasks are converted into global graph guide tasks to form four different graph convolution neural network models.

Further, the step (54) specifically includes:

(541) when the GAT layer is transmitted, the flow node v is calculated according to the characteristic that each flow node is connected with different neighbor flow nodes_aTo flow node v_bCoefficient of correlation e_abFurther calculating the attention coefficient of each edge, wherein the formula is as follows;

in the formula, LeakyReLU represents an activation function, α_abRepresents a flow node v_aTo flow node v_bThe attention coefficient of (b), k represents a certain neighbor process node of the process node (a),

a neighbor process node set representing a process node a;

(542) according to the attention coefficient obtained by calculation, weighting and summing the characteristics, wherein the formula is as follows;

in formula (II), x'_aRepresenting that each process node fuses new features of neighborhood information, W represents a learnable weight, x_bRepresenting a flow node v_bThe features of (a);

(543) new feature x 'from generation of each flow node'_aAnd generating a new flow node feature set X', and forming a GAT layer to transmit features by using the information transmission mode of the GCN layer.

Further, the step (55) specifically includes:

(551) after aggregating the characteristics of the neighbor process nodes, the GraphSAGE layer aggregates the characteristics of the neighbor process nodes and the characteristics of the process nodes, wherein the aggregation method is specifically expressed as follows;

wherein k represents the total iterative polymerization number, W^(k)Represents the weight to be learned at the k-th aggregation, σ represents the nonlinear transformation,

represents the characteristics of the a flow node after the k-1 aggregation, AGGREGATE_kDenotes the aggregation function of the kth time, γ (v)_a) A set of neighboring process nodes representing the a-th process node,

the CONCAT represents a function for splicing the characteristics of the a flow node and the characteristics of the neighboring flow nodes;

(552) performing L2 standardization on the aggregated characteristics of each flow node, wherein the formula is as follows;

wherein V represents a set of flow nodes in a graph network structure,

and representing the characteristic vector of the a-th flow node after k times of aggregation.

Further, the specific process of the step (6) is as follows:

(61) the three selected machine learning models are a decision tree model, a random forest model and an XGboost model;

(62) partition training set X_trainVerification set X_valAnd test set X_test；

(63) Normalizing the data, putting the normalized data into each model, averaging the model results of each operation by adopting a K-fold cross validation method, training and adjusting parameters to obtain the optimal model of three machine learning models;

(64) and respectively inputting the training set data into four graph convolution neural network models combined by double-layer GCN, double-layer GAT, double-layer GraphSAGE and single-layer GCN and single-layer GraphSAGE for training, taking the average absolute error (MAE) as the error of back propagation to update the weight, and performing model training and parameter adjustment for multiple times to obtain the optimal model of the four graph convolution neural network models.

Further, the specific process of step (7) is as follows:

(71) selecting three indexes of Mean Absolute Error (MAE), Root Mean Square Error (RMSE) and Mean Absolute Percent Error (MAPE) to measure the distance between the predicted value and the true value of each model;

(72) and (4) inputting the test set into the obtained optimal models of the three machine learning models and the four graph convolution neural network models, and comparing the performances of the three machine learning models and the four graph convolution neural network models by using the index (71).

The invention has the beneficial effects that:

the method focuses on multi-step processes of the over-station flight during ground guarantee, and starts from the angles of correlation of various processes and different lengths of operation time, a graph network structure with a logical relation is constructed, the operation process time of a standard flight is extracted from the time set of all process nodes, and the operation process time of a delayed flight is processed to obtain the node characteristics on the graph network, so that the flight operation process is combined with flight delay prediction, the data set of the flight delay prediction is enriched, and the accuracy of the flight delay time prediction is improved.

The method considers the relevant data of the operation flow and the logic relation between the data when the flight is ensured on the ground, and has practical application value in the aspects of the generation mechanism and the time prediction of the flight ground takeoff delay.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

FIG. 2a is a boxplot of time differences for non-overnight flights in a non-delayed flight, according to an embodiment of the invention.

FIG. 2b is a boxed graph of the time differences for the overnight flights in a non-delayed flight, in accordance with an embodiment of the present invention.

Fig. 3 is a flow chart related to a passenger according to an embodiment of the present invention.

FIG. 4 is a diagram illustrating a neural network according to an embodiment of the present invention.

Fig. 5 is a schematic diagram of K-fold cross validation according to an embodiment of the present invention.

Detailed Description

In order to facilitate understanding of those skilled in the art, the present invention will be further described with reference to the following examples and drawings, which are not intended to limit the present invention.

Referring to fig. 1, the flight delay prediction method based on the station-passing flight support process of the present invention includes the following steps:

(1) collecting time data generated by the airport in a specific time period on each process node when the flight passes the station, cleaning the original data of the time on each process of the flight according to the logical relation of the flight operation process, and using the original data as an original data set T_p；

Wherein, the flow node of the stop-passing flight in the step (1) comprises: the system comprises a planned entry node, an actual landing node, an actual gear shift node, an open cabin door node, a close cabin door node, a boarding gate open node, a boarding gate close node, an actual gear shift removal node, an actual sliding start node, an actual take-off node and a planned departure node.

The flight time data in the step (1) comprises: planned arrival time, actual landing time, actual gear shift time, cabin door opening time, cabin door closing time, boarding gate opening time, boarding gate closing time, actual gear shift removing time, actual sliding starting time, actual take-off time and planned departure time.

In an example, time data of all the over-station flight guarantee process nodes of the Pudong international airport 2019 from 6/1/12/31 are collected, and the collected data comprises a time set including 11 process nodes such as planned take-off time, planned arrival time, actual take-off time, actual landing time and actual gear shift time of a front station;

due to the fact that the acquired time set has a front-back logic relationship, data needs to be analyzed and preprocessed reasonably; the principle of treatment is as follows:

(12) because the example data has the condition of overnight takeoff, according to the difference length of the opening time and the closing time recorded in each flight time set, the data of which the difference length exceeds 400, the opening time is in the evening, and the closing time is in the next morning is classified as the overnight flight set T_p-overClassifying the data of which the difference length does not exceed 400 or does not satisfy the door opening time in the evening and the door closing time in the next morning into a non-overnight flight set T_p-nonover；

(13) To collect the actual landing time t to each flight_alPlanned arrival time t_eaActual takeoff time t_adAnd planned departure time t_edFor reference of flightLine time, cleaning the flight time set data of each sample; the abnormal time and the missing time close to the actual landing time and the actual takeoff time are processed preferentially, the time data needing to be processed are divided into two types, one type of time is the actual landing time t of the flight_alOr actual takeoff time t_adAssociating another type of time with two or more times before and after the flight; aiming at two types of abnormal values or missing values, respectively collecting T in the overnight flight set_p-overAnd non-overnight flight set T_p-nonoverCleaning by using a logical relation with complete time data on different process nodes; the following is described with two types of problem node examples respectively;

(131) the first class of outliers is illustrated using the actual taxi-start time, t, if the actual taxi-start time of the ith flight is t_i-atIf the flight is missing or abnormal, the actual taxiing starting time formula of the ith flight is recalculated as follows;

(132) the second category of outliers is exemplified using the actual door closing time, t, if the actual door closing time for the ith flight is_i-cdIf the flight door is missing or abnormal, the actual door closing time formula of the ith flight obtained by recalculation is as follows;

in the formula, t_i-cd(cal) represents the recalculated actual door closing time for the ith flight, and p-name represents the flight in the data setNumber of strips, t_i-boffActual gear-off time, t, representing the ith flight_i-odRepresenting the actual opening door time of the ith flight and p-noover and p-over representing the number of flights for non-overnight flights and overnight flights, respectively.

(2) Computing a raw data set T_pTime difference value set D of actual takeoff time and planned departure time of all flights in the flight system_diffAccording to the flight take-off normal time range [0, t ]_r]Wherein t is_rRepresenting the maximum normal time of flight departure, and dividing the delayed flight time set T_delayAnd non-delayed flight time set T_non-delayRespectively calculating delayed flight time set T_delayAnd non-delayed flight time set T_non-delayThe time difference value set on each subsequent-preamble flow node of the medium flight is recorded as Diff_delayAnd Diff_non-delayCounting time difference value set Diff on the nodes of the subsequent-preceding process of the non-delayed flight_non-delaySelecting the data segment of the time difference value of each subsequent-preorder process node, wherein the data segment is the mean value, the upper quartile, the lower quartile, the maximum value and the minimum value of the time difference value of each subsequent-preorder process node, and the data segment is the standard time segment set D_std；

The specific process of the step (2) is as follows:

(21) computing a raw data set T_pTime difference value set D between actual take-off time and planned departure time of each flight_diffThe formula is as follows:

in the formula, d_iRepresenting the calculated differential time, t, for the ith flight_i-adIndicating the actual departure time of the ith flight,t_i-edrepresenting the planned departure time of the ith flight;

(22) set of flights T at night_p-overAnd non-overnight flight set T_p-nonoverIn the middle, the delayed flight time sets T are divided respectively_delayAnd non-delayed flight time set T_non-delayThe formula is as follows;

(23) set of flights T at night_p-overAnd non-overnight flight set T_p-nonoverRespectively calculating delayed flight time sets T_delayAnd non-delayed flight time set T_non-delayTime difference value set Diff on each subsequent-preamble flow node of medium flight_delayAnd Diff_non-delayThe difference between the actual takeoff time and the subsequent process nodes is 0, wherein the time difference between the subsequent process nodes and the previous process nodes represents the difference between the subsequent time and the previous time on the two adjacent process nodes, and comprises the actual landing time, the planned arrival time, the actual gear shifting time, the actual landing time, the door opening time, the gear closing time, the door opening time, the actual gear withdrawing time, the door closing time, the actual gear withdrawing time, the door closing time, the actual gear withdrawing time, the door opening time, the actual starting sliding time, the actual gear withdrawing time, the actual starting sliding time, the actual taking-the planned departure time;

(24) respectively counting overnight flight sets T_p-overAnd non-overnight flight set T_p-nonoverSet of medium and non-delayed flight times T_non-delayTime difference value set Diff at each of the following-preamble flow nodes of_non-delayThe mean value, the upper quartile, the lower quartile, the maximum value and the minimum value of the flight data are drawn by using a box diagramTime difference value set Diff on preamble flow nodes_non-delayThe distribution state of each time difference is shown in fig. 2 a-2 b, and the data segment where the upper quartile QU and the lower quartile QL of each time difference on each subsequent-preamble process node are located is selected as a standard time segment set D_std(ii) a The selected node differences are shown in table 1:

TABLE 1

(3) Aiming at the front-back connection characteristics of each flight flow node, constructing each flow node of the flight into a non-European space graph network structure G with a logical relationship; as shown with reference to figure 3 of the drawings,

the specific process of the step (3) is as follows:

(31) creating graph network structure G (V, E) aiming at used flight flow node data, wherein G represents the graph network structure after creation, V_aE V represents non-null by a finite number of flow nodes V ═ V₁,v₂,v₃,......v_nSet of points, v₁Represents the 1 st flow node, (v)_a,v_b) E denotes a finite edge E ═ E₁,e₂,e₃,......e_{p-delay*(m-1)}Set of edges, e₁Representing edge 1, p-delay represents the number of flights that delayed a flight;

in the formula, A_abAn adjacency matrix representing the constructed graph network structure, a and b respectively represent flow node numbers, v_aAnd v_bRespectively representing a flow node and b, R represents a real number field, N represents the number of all flow nodes of all delayed flights, N is p-delay m, p-delay represents the number of the delayed flights, m represents the number of the flow nodes contained in one flight, D_abA degree matrix representing the network structure of the graph.

(4) Standard time period set D using time difference values of each subsequent-preceding flow node_stdCalculating D_stdTime difference value set Diff on follow-preamble flow node of delayed flight_delaySet of distances T_stdSet the distances T_stdInputting a set X as the characteristic of a process node, loading the X as the characteristic of the process node into a graph network structure G, and arranging an edge set E', an index set I, a node characteristic set X and a time difference value set D of actual takeoff time and planned departure time_diffCollectively packaged into a graph dataset;

the specific process of the step (4) is as follows:

(42) sequentially arranging flow nodes belonging to different graphs according to indexes, wherein the flow nodes in each graph have the same index value to obtain an index set I, and storing the index set I into a data set 2;

(43) using the set D of standard time periods of the non-delayed flights obtained in step (24)_stdRespectively with the set of overnight flights T_p-overAnd non-overnight flight set T_p-nonoverDelayed flight time set T in_delayTime difference value set Diff on a following-preceding flow node of_delayComparing the flight time set T with the formula (8) to obtain the delayed flight time set T_delayTime difference set Diff with nodes of a subsequent-preceding flow_delaySet of distances T between_stdSet the distances T_stdThe feature input set X serving as the process node is sequentially arranged according to the flight sequence by serving as the node attribute, and the arranged data is stored into a data set 3 according to a two-dimensional data format;

in the formula, T_stdIndicating a set of delayed flight times T_delayTime difference value set Diff on nodes of following-preceding procedure_delaySet of distances between, T_isDistance, QL, representing the s-th time difference of the ith flight from the standard time period_sLower quartile, QU, of standard period representing the s-th time difference_sUpper quartile, Diff, of standard period representing the s-th time difference_delay-isA value representing the s time difference of the ith delayed flight, and p-delay represents the number of delayed flights;

(44) collecting the time difference values D of the actual takeoff time and the planned departure time of all delayed flights in the sample_diff-delayAs graph attributes, D is arranged in sequence according to flight order and is in a two-dimensional data format_diff-delayStored as a tag into the data set 4;

the specific process of the step (5) is as follows:

(51) for the created graph network structure G, transmitting the features on the graph to a lower layer by using a propagation rule f, wherein the formula is as follows;

in the formula, H^(l)Features representing the l-th layer, A represents a graph network structure description of the flow node, where the adjacency matrix A is used_abDenotes, Z denotes output, X⁽⁰⁾Representing and calculating a characteristic input set of a model needing to be input;

(52) mapping the delivery of layers onto specific data, the formula is expressed as follows:

X^(l+1)＝f(X^(l),A) (10)

in the formula, X^(l+1)Representing the feature output set of the flow nodes of the l layers, D representing the degree matrix of the graph network structure of the input l layers, I_nRepresenting a self-loop of the network structure of the diagram, W^(l)The convolution kernel representing the l-th layer, i.e. the learnable weight, σ represents the nonlinear transformation;

(56) and (2) aiming at different graph convolution neural network layers, considering the time of the station-crossing flight flow as characteristic input and a graph guide task needing to be predicted, and constructing a double-layer graph convolution neural network, wherein the method comprises the following steps of: a double-layer GCN layer, a double-layer GAT layer, a double-layer GraphSAGE layer, a single-layer GCN layer and a single-layer GraphSAGE layer are connected to form a graph neural network; each double-layer graph neural network is connected with the double-layer fully-connected neural network and the final pooling layer, and the node guide tasks and the edge guide tasks are converted into global graph guide tasks to form four different graph convolution neural network models, which are shown in a figure 4.

Further, the step (54) specifically includes:

a neighbor process node set representing a process node a;

x 'in the formula'_aRepresenting that each process node fuses new features of neighborhood information, W represents a learnable weight, x_bRepresents a flow node v_bThe features of (1);

(543) new feature x 'from generation of each flow node'_aGenerating new flow node feature set X', forming GAT layer to transmit features by GCN layer information transmission mode。

Further, the step (55) specifically includes:

wherein V represents a set of flow nodes in a graph network structure,

the specific process of the step (6) is as follows:

(62) the number of the finally available samples obtained by cleaning the data is 18794, and in the comparison model, the first 16500 samples are used as a training sample set X_trainAnd validating sample set X_valBy using a K-fold cross validation method, as shown in fig. 5, training of samples is performed, and 16501 to 18794 samples are used as a test sample set X_test(ii) a In four graph convolutional neural network models combining double-layer GCN, double-layer GAT, double-layer GraphSAGE and single-layer GCN with single-layer GraphSAGE, the first 15000 strips are taken as training sample sets, the 15001 to 16500 strips are taken as verification sample sets, and the 16501 to 18794 strips are taken as test sample sets;

(63) normalizing the samples of the training set to accelerate the training speed of the model and improve the prediction accuracy of the model, wherein the formula is as follows;

in the formula, x _ std_ijIs a new feature, x, of the ith flight normalized by the jth feature of the ith flight_ijRepresents the characteristics of the ith flight before the jth characteristic is normalized, x_min-jDenotes the minimum value, x, in the jth feature_max-jRepresents the maximum value in the jth feature;

aiming at three machine learning models of a decision tree, a random forest and an XGboost, in a K-fold cross validation method, making K equal to 5, averaging the model result of each operation, adjusting parameters, and referring to tables 2, 3 and 4 for the adjusted optimal parameter combination;

TABLE 2

TABLE 3

TABLE 4

(64) Respectively inputting training set data into four graph convolution neural network models combined by double-layer GCN, double-layer GAT, double-layer GraphSAGE and single-layer GCN and single-layer GraphSAGE for training, taking average absolute error (MAE) as counter-propagating error to update weight, and performing model training and parameter adjustment for multiple times to obtain an optimal model of the four graph convolution neural network models; as shown in table 5;

TABLE 5

(7) Comparing and evaluating model results by using the obtained three machine learning models and the optimal model of the four graph convolution neural network models;

the specific process of the step (7) is as follows:

(71) selecting three indexes of Mean Absolute Error (MAE), Root Mean Square Error (RMSE) and Mean Absolute Percentage Error (MAPE) to measure the distance between the predicted value and the true value of each model; the formulas are respectively as follows:

wherein p-delay represents the number of flights that delay a flight, y_i' tag value, y, for the ith flight predicted by each model_iA true tag value representing the ith flight;

(72) and (4) inputting the test set into the obtained optimal models of the three machine learning models and the four image convolution neural network models, and comparing the performances of the three machine learning models and the four image convolution neural network models by using the index (71). The seven model predictions are shown in table 6:

TABLE 6

It can be seen from table 6 that the conventional methods such as random forest and decision tree are difficult to adapt to time node data, and the predicted flight delay time on the verification set has a larger error compared with the true value, but the method of the present invention takes into account the front-back logic relationship of various operation flows in the flight over-station flow, converts the logic relationship into a graph structure and loads the graph structure into node characteristics, and the predicted flight delay time is more accurate than that of the conventional methods, so the method of the present invention has better applicability on flight delay prediction.

While the invention has been described in terms of its preferred embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention.

Claims

1. A flight delay prediction method based on a station-passing flight guarantee process is characterized by comprising the following steps:

(2) Computing a raw data set T_pTime difference value set D of actual takeoff time and planned departure time of all flights in the flight system_diffAccording to the flight departure normal time range [0, t_r]Wherein t is_rRepresenting the maximum normal time of flight departure, and dividing the delayed flight time set T_delayAnd non-delayed flight time set T_non-delaySeparately calculating delayed flight time sets T_delayAnd non-delayed flight time set T_non-delayThe time difference value set on each subsequent-preamble flow node of the medium flight is marked as Diff_delayAnd Diff_non-delayCounting time difference set Diff on the following-preamble flow nodes of the non-delayed flight_non-delaySelecting the data segment of the time difference value of each subsequent-preorder process node, wherein the data segment is the mean value, the upper quartile, the lower quartile, the maximum value and the minimum value of the time difference value of each subsequent-preorder process node, and the data segment is the standard time segment set D_std；

(3) Aiming at the front-back connection characteristics of each flight flow node, constructing each flow node of the flight into a non-European space graph network structure G with a logical relationship;

(4) standard time period set D using time difference values of each subsequent-preceding flow node_stdCalculating D_stdTime difference value set Diff on follow-preamble flow node of delayed flight_delaySet of distances T_stdSet the distances T_stdInputting a set X as the characteristic of a process node, loading the X into a graph network structure G as the characteristic of the process node, and inputting an edge set E', an index set I, a node characteristic set X after arrangement, and a time difference value set D of delayed flight actual takeoff time and planned departure time_diff-delayCo-packaging to a graph dataset;

2. The flight delay prediction method based on the stop-passing flight support process according to claim 1, wherein the specific process of the step (1) is as follows:

(11) checking the time context correlation of each collected flight time set, and eliminating abnormal values recorded in the time data;

(12) according to the difference length of the door opening time and the door closing time recorded in each flight time set, classifying the data of which the difference length exceeds 400, the door opening time is in the evening, and the door closing time is in the next morning into an overnight flight set T_p-overClassifying the data of which the difference length does not exceed 400 or does not satisfy the door opening time in the evening and the door closing time in the next morning into a non-overnight flight set T_p-nonover；

(13) To collect the actual landing time t to each flight_alPlanned arrival time t_eaActual takeoff time t_adAnd planned departure time t_edCleaning each sample flight time set data for the reference execution time of the flight; processing abnormal and missing time close to actual landing time and actual takeoff time in priority, and dividing time data to be processed into two types, namely first type time and actual landing time t of the flight_alOr actual takeoff time t_adAssociating a second type of time with the flightThe front and back are associated with each other by two or more times; aiming at two types of abnormal values or missing values, respectively collecting T in the overnight flight set_p-overAnd non-overnight flight set T_p-nonoverCleaning by using a logical relation with complete time data on different process nodes;

3. The flight delay prediction method based on the stop-passing flight support process according to claim 2, wherein the step (13) specifically comprises:

(131) for the first class of outliers, if the actual taxi-starting time t of the ith flight is_i-atIf the absence or the abnormality exists, the actual taxiing starting time formula of the ith flight is recalculated as follows;

in the formula, t_i-at(cal) represents the recalculated actual taxi-starting time for the ith flight, p-name represents the number of flights in the data set, t_i-adThe actual takeoff time of the ith flight is represented, and the p-noover and the p-over respectively represent the flight numbers of the non-overnight flights and the overnight flights;

in the formula, t_i-cd(cal) represents the recalculated actual door closing time of the ith flight, p-name generationNumber of flights in the table dataset, t_i-boffIndicating the actual off-gear time, t, for the ith flight_i-odRepresenting the actual opening door time of the ith flight and p-noover and p-over representing the number of flights for non-overnight flights and overnight flights, respectively.

4. The flight delay prediction method based on the outbound flight support process according to claim 1, wherein the specific process of the step (2) is as follows:

in the formula (d)_iRepresenting the differential time, t, calculated for the ith flight_i-adRepresenting the actual departure time, t, for the ith flight_i-edRepresenting the scheduled departure time of the ith flight;

(22) set of flights T at night_p-overAnd non-overnight flight set T_p-nonoverIn the middle, the delayed flight time sets T are divided respectively_delayAnd non-delayed flight time set T_non-delayTwo types, the formula is as follows;

(23) set of flights T at night_p-overAnd non-overnight flight set T_p-nonoverRespectively calculating delayed flight time sets T_delayAnd non-delayed flight time set T_non-delayTime difference value set on each subsequent-preorder flow node of medium flightDiff_delayAnd Diff_non-delayWherein the time difference value on the subsequent-preorder process nodes represents the difference value between the subsequent time and the preorder time on two adjacent process nodes, and comprises actual land falling time-planned arrival time, actual gear shifting time-actual landing time, door opening time-gear shifting time, boarding gate closing time-boarding gate opening time, door closing time-cabin door opening time, actual gear withdrawing time-door closing time, actual gear withdrawing time-boarding gate closing time, actual starting sliding time-actual gear withdrawing time, actual take-off time-actual starting sliding time and actual take-off time-planned departure time;

(24) respectively counting overnight flight sets T_p-overAnd non-overnight flight set T_p-nonoverMedium non-delayed flight time set T_non-delayTime difference value set Diff at each of the following-preamble flow nodes of_non-delayThe mean value, the upper quartile, the lower quartile, the maximum value and the minimum value of the model are used for drawing a time difference value set Diff on a subsequent-preamble flow node of the non-delay flight by using a box diagram_non-delaySelecting the data segment of the upper quartile QU and the lower quartile QL of each time difference on each subsequent-preorder process node as a standard time segment set D_std。

5. The flight delay prediction method based on the outbound flight support process according to claim 1, wherein the specific process of the step (3) is as follows:

(31) creating graph network structure G (V, E) aiming at used flight flow node data, G represents graph network structure after creation, V_aE V represents non-null by a finite number of flow nodes V ═ V₁,v₂,v₃,......v_nV set of points, v₁Represents the 1 st flow node, (v)_a,v_b) E denotes a finite edge E ═ E₁,e₂,e₃,......e_{p-delay*(m-1)}Set of edges, e₁Representing the 1 st edge, and p-delay representing the number of flights that delay a flight;

6. The flight delay prediction method based on the outbound flight support process according to claim 1, wherein the specific process of the step (4) is as follows:

(43) using the indicia of non-delayed flights obtained in step (24)Set of punctual periods D_stdRespectively associated with overnight flight set T_p-overAnd non-overnight flight set T_p-nonoverDelayed flight time set T in_delayTime difference set Diff on a subsequent-preamble flow node of (1)_delayComparing the flight time set T with the formula (8) to obtain the delayed flight time set T_delayTime difference value set Diff on nodes of following-preceding procedure_delaySet of distances T between_stdSet the distances T_stdThe feature input set X is used as a process node, the feature input set X of the process node is used as a node attribute and is sequentially arranged according to the flight sequence, and the arranged data is stored into a data set 3 according to a two-dimensional data format;

in the formula, T_stdIndicating a set of delayed flight times T_delayTime difference value set Diff on nodes of following-preceding procedure_delaySet of distances between, T_isDistance, QL, representing the s-th time difference of the ith flight from the standard time period_sLower quartile, QU, of standard period representing the s-th time difference_sUpper quartile of standard period, Diff, representing the s-th time difference_delay-isA value representing the s time difference of the ith delayed flight, and p-delay represents the number of delayed flights;

(44) collecting the time difference values D of the actual departure time and the planned departure time of all delayed flights in the sample_diff-delayAs graph attributes, arranging the time difference values in sequence according to flight sequence, and collecting the time difference values D according to a two-dimensional data format_diff-delayStored in the data set 4 for the tag;

(45) and packaging the data set 1, the data set 2, the data set 3 and the data set 4 into a graph data set, so that the graph network structure taken out each time is a subset of the original graph network structure.

7. The flight delay prediction method based on the outbound flight support process according to claim 6, wherein the specific process of the step (5) is as follows:

in the formula, H^(l)Features representing the l-th layer, A represents a graph network architecture description of the flow nodes, where the adjacency matrix A is used_abDenotes, Z denotes output, X⁽⁰⁾Representing and calculating a characteristic input set of a model needing to be input;

X^(l+1)＝f(X^(l),A) (10)

in the formula, X^(l+1)Representing the feature output set of the flow nodes of the l layers, D representing the degree matrix of the graph network structure of the input l layers, I_nShowing the self-circulation of the network structure of the graph, W^(l)A convolution kernel representing the l-th layer, i.e. a learnable weight, σ represents a nonlinear transformation;

(54) constructing a GAT layer, and distributing different weights to different edges through attention coefficients;

8. The method for predicting flight delay based on the outbound flight support flow according to claim 7, wherein the step (54) specifically comprises:

(541) when the GAT layer is transmitted, the flow node v is calculated according to the characteristic that each flow node is connected with different neighbor flow nodes_aTo flow node v_bCoefficient of correlation e of_abFurther calculating the attention coefficient of each edge, wherein the formula is as follows;

a neighbor process node set representing a process node a;

x 'in the formula'_aRepresenting that each process node fuses new features of neighborhood information, W represents a learnable weight, x_bRepresenting a flow node v_bThe features of (1);

9. The flight delay prediction method based on the outbound flight support flow according to claim 7, wherein the step (55) specifically comprises:

wherein k represents the total iterative polymerization degree, W^(k)Represents the weight to be learned at the k-th aggregation, σ represents the nonlinear transformation,

represents the characteristics of the a flow node after the k-1 aggregation, AGGREGATE_kDenotes the k-th aggregation function, γ (v)_a) A set of neighboring process nodes representing the a-th process node,

representing all the characteristics of the adjacent process nodes of the a-th process node, wherein CONCAT represents a function for splicing the characteristics of the a-th process node and the characteristics of the adjacent process nodes;

(552) performing L2 standardization on the aggregated characteristics of each process node, wherein the formula is as follows;

wherein V represents a set of flow nodes in a graph network structure,

10. The flight delay prediction method based on the outbound flight support process according to claim 7, wherein the specific process of the step (6) is as follows:

(62) partition training set X_trainVerification set X_valAnd test set X_test；

(63) Normalizing the data, putting the normalized data into each model, averaging the model results of each operation by adopting a K-fold cross validation method, training and adjusting parameters to obtain an optimal model of three machine learning models;

(64) and respectively inputting the training set data into four graph convolution neural network models combined by double-layer GCN, double-layer GAT, double-layer GraphSAGE and single-layer GCN and single-layer GraphSAGE for training, taking the average absolute error as the back propagation error to update the weight, and performing model training and parameter adjustment for multiple times to obtain the optimal model of the four graph convolution neural network models.