CN110852497A

CN110852497A - Scene variable slide-out time prediction system based on big data deep learning

Info

Publication number: CN110852497A
Application number: CN201911044358.3A
Authority: CN
Inventors: 周龙
Original assignee: Nanjing Smart Aviation Research Institute Co Ltd
Current assignee: Nanjing Smart Aviation Research Institute Co Ltd
Priority date: 2019-10-30
Filing date: 2019-10-30
Publication date: 2020-02-28
Also published as: WO2021082394A1

Abstract

The invention relates to a scene variable slide-out time prediction system based on big data deep learning, which comprises: the data set establishing module is suitable for acquiring historical operating data and performing data cleaning to obtain a data set; the index definition and quantification module is suitable for defining and quantifying the traffic condition index of the scene traffic characteristic; a feature set extraction module adapted to be based onAnalyzing and extracting a data set and a traffic condition index and a feature set influencing the slide-out time of the scene; the model building module is suitable for building a scene slide-out time prediction model through an integrated machine learning method according to the feature set, and the prediction module is suitable for completing prediction of the scene slide-out time of the airport through the scene slide-out time prediction model. Processing airport original recorded data, modeling airport scene traffic conditions, analyzing and extracting influence factors of taxi time, and trainingGBRTAnd integrating the learning model to obtain a slip-out time prediction model, and providing a data basis for the management and optimization of the airport operation.

Description

Scene variable slide-out time prediction system based on big data deep learning

Technical Field

The invention relates to the field of airport traffic control, in particular to a scene variable slide-out time prediction system based on big data deep learning.

Background

In the prior art, aircraft slide-out time prediction is mostly modeled from two aspects: and (5) simulating and analyzing. The simulation model uses the existing airport topological structure model, conflict detection and solution as factors, and obtains the slide-out time by simulating the operation of all on-ground entering and leaving aircrafts. The simulation model has strong pertinence and has no good universality for different airports. The conventional research on analysis models has focused mainly on linear regression models, and some attempts have been made to use machine learning techniques. For the analytical model, the determination of the main factors influencing the glide time is an emphasis of the study. The analytical model usually has the defects of incomplete influence factors and the like, and the actual reference value is weak, so that the requirements of actual application cannot be met.

How to solve the above problems is a need to be solved.

Disclosure of Invention

The invention aims to provide a scene variable slide-out time prediction system based on big data deep learning, so as to achieve the purpose of improving the comprehensiveness of influence factors in an analysis model.

In order to solve the technical problem, the invention provides a scene variable slide-out time prediction system based on big data deep learning, which comprises:

the data set establishing module is suitable for acquiring historical operating data and performing data cleaning to obtain a data set;

the index definition and quantification module is suitable for defining and quantifying the traffic condition index of the scene traffic characteristic;

the characteristic set extraction module is suitable for analyzing and extracting a characteristic set influencing the slide-out time of the scene based on the data set and the traffic condition indexes;

the model establishing module is suitable for establishing a scene slide-out time prediction model through an integrated machine learning method according to the feature set;

and the prediction module is suitable for completing prediction of the airport scene slide-out time through the scene slide-out time prediction model.

Further, the data set establishing module comprises:

the original data set acquisition unit is suitable for acquiring historical operating data to construct an original data set;

the data cleaning unit is suitable for cleaning the original data set;

the data set acquisition unit is suitable for integrating the original data set to acquire a data set;

and the data set dividing unit is suitable for dividing the data set into a training set and a test set.

Further, the index defining and quantizing module includes:

the network topological structure acquisition unit is suitable for modeling the traffic situation of the airport scene by adopting a macroscopic space-time network topological model to acquire a macroscopic space-time network topological structure;

and the quantization unit is suitable for defining four types of indexes for representing scene traffic based on a macroscopic space-time network topological structure and quantizing the indexes.

Further, the feature set extraction module comprises:

the original feature set extracting unit is suitable for extracting features influencing scene slide-out time from the data set and the traffic condition indexes and forming an original feature set;

a feature analysis unit adapted to perform feature analysis on the features in the original feature set

And the feature set construction unit is suitable for constructing a feature set according to the feature analysis result.

Further, the feature analysis unit is configured to: performing feature analysis on the features of the original feature set by adopting one or more of correlation coefficient, standardized mutual information and factor analysis;

correlation measurement correlation coefficient reflects statistic of linear correlation degree of two variables, and the value of the statistic is [ -1, 1]The larger the absolute value is, the stronger the linear correlation degree is, the positive value is positive correlation, the negative value is negative correlation, X, Y is used for representing any two variables, and the correlation measure is a correlation coefficient P_X,YIs defined as:

where Cov (X, Y) is the covariance of X and Y, σ_X、σ_YIs a standard deviation of X, Y, μ_X、μ_YIs the mean value of X, Y;

the standardized mutual information is a common correlation metric with a value range of [0, 1 ]]The larger the value is, the larger the degree of correlation between the variables is, and the mutual information U is normalized_X,YIs defined as:

wherein, I_X,YIs mutual information of X, Y, H_X、H_YFor X, Y respective entropy, p (x, y) is X, Y joint probability distribution, and p (x), p (y) are X, Y respective probability distributions

Factor analysis, that is, the extracted feature x is completely controlled by a potential influence factor z, and the expression is equal to Az + epsilon, where a is a coefficient matrix and epsilon is an error, and the influence factors are independent of each other and the influence factors and the error are independent of each other, and finally, the derivation is performed: sigma_x＝AA^T+∑_εWhere Σ represents the covariance matrix, so that a and z can be found.

Further, the model building module comprises:

the initial model acquisition unit is suitable for acquiring an initial model by taking the feature set as the input of the integrated learning model GBRT;

a training unit suitable for training the initial model and adjusting the value of the hyper-parameter so as to complete the establishment of the scene slide-out time prediction model

Further, the training unit is to:

selecting the maximum depth as a control mode for controlling the decision tree;

selecting least squares as a loss function;

under the optimal product value, selecting the maximum learning rate and the corresponding minimum estimator quantity which can keep the performance stable;

setting minimum samples to be divided into 200 according to the whole data distribution of the sliding-out time in the training set;

and finishing the training of the initial model so as to establish a scene slide-out time prediction model.

Further, the model building module further comprises

And the model test unit is suitable for verifying the scene slide-out time prediction model by using a test set and evaluating the performance.

Further, the performance evaluation in the model test unit adopts a mean square error, and the calculation formula is as follows:

where N is the number of test set samples, o_iIs the actual glide time, p, of the ith sample_iIs the predicted glide time for the model.

The invention has the beneficial effect that the invention provides a scene variable slide-out time prediction system based on big data deep learning. The scene variable slide-out time prediction system based on big data deep learning comprises: the data set establishing module is suitable for acquiring historical operating data and performing data cleaning to obtain a data set; the index definition and quantification module is suitable for defining and quantifying the traffic condition index of the scene traffic characteristic; the characteristic set extraction module is suitable for analyzing and extracting a characteristic set influencing the slide-out time of the scene based on the data set and the traffic condition indexes; the model building module is suitable for building a scene slide-out time prediction model through an integrated machine learning method according to the feature set, and the prediction module is suitable for completing prediction of the scene slide-out time of the airport through the scene slide-out time prediction model. . The method comprises the steps of processing original recorded data of an airport, modeling traffic conditions of the airport scene, analyzing and extracting influence factors of sliding time, training a GBRT ensemble learning model, further obtaining a sliding-out time prediction model, and providing data basis for management and optimization of airport operation.

Drawings

The invention is further illustrated with reference to the following figures and examples.

FIG. 1 is a schematic block diagram of a scene variable slide-out time prediction system based on big data deep learning provided by the present invention.

FIG. 2 is a schematic diagram of a taxi process macro spatiotemporal network topology provided by the present invention.

FIG. 3 is a correlation coefficient of a candidate influencing factor with respect to a slip-out time metric provided by the present invention.

FIG. 4 is a normalized mutual information relationship of candidate influencing factors and roll-out time provided by the present invention.

FIG. 5 is a graph of the results of factor analysis of candidate influencing factors provided by the present invention.

FIG. 6 is a diagram of the performance variation process during the model training and testing phases provided by the present invention.

Detailed Description

The present invention will now be described in further detail with reference to the accompanying drawings. These drawings are simplified schematic views illustrating only the basic structure of the present invention in a schematic manner, and thus show only the constitution related to the present invention.

Example 1

As shown in fig. 1, this embodiment 1 provides a scene variable slide-out time prediction system based on big data deep learning, which processes original recorded data of an airport, models traffic conditions of the airport scene, analyzes and extracts influence factors of slide time, trains a GBRT ensemble learning model, and further obtains a slide-out time prediction model, thereby providing a data basis for management and optimization of airport operation. Specifically, the scene variable slide-out time prediction system based on big data deep learning comprises:

the model establishing module is suitable for establishing a scene slide-out time prediction model by an integrated machine learning method according to the feature set,

In this embodiment, the data set creating module includes:

and the original data set acquisition unit is suitable for acquiring historical operating data to construct an original data set.

Specifically, data are extracted from an airport scene operation database as much as possible to form an airport flight departure operation original data set. Collecting relevant information of a sliding track, including a departure runway, a departure parking place, a corridor port number, a sliding length and the like; collecting flight attribute related information including flight number, flight type, model, affiliated navigation department, engine type and the like; collecting traffic control related information including whether the traffic control is limited, controller information, communication information, delay conditions, local weather, airport broadcasting and the like; collecting flight plan related information including a takeoff airport, a target airport, planned takeoff time, planned gear-removing time, waypoint information and the like; collecting the actual recording information of the sliding process, wherein the actual recording information comprises the time of removing the gear, the pushing time, the time of requesting/allowing driving, the actual takeoff time, the sliding speed, the waiting time of the track head and the like.

And the data cleaning unit is suitable for cleaning the original data set.

Specifically, a specific processing scheme is formulated for actual work in consideration of the situation that an airport acquires a data set in real time. In the aspect of missing value processing, two methods of setting default values and directly deleting are adopted. Default values are set, default value no is set for 'limited or not', and default value no is set for 'limited content'. After the default value is filled, the attributes with more than half of information missing are directly deleted, wherein the attributes comprise 'request for driving', 'permission for driving', 'wheel-removing time', 'wake', 'sliding speed' and 'departure queue number'. And then, carrying out completeness check on the data set, and deleting the data entries with missing information. In the aspect of abnormal value processing, firstly, basic check is carried out on data types of all attributes and whether the data types are out of bounds, and then a delimiting detection method is adopted to further check abnormal values for partial attributes. And defining a value range for the attribute based on the actual operation condition of the airport scene, and regarding data with the value not in the corresponding range as an abnormal value. And finally deleting the data entry containing the abnormal value from the data set. The attribute value ranges are shown in the following table.

Value range of partial attribute

And the data set acquisition unit is suitable for integrating the original data set to acquire the data set.

Specifically, the step includes the tasks of redundant attribute identification, data type conversion and logic error check. And identifying and deleting the redundant attributes, identifying the redundant attributes with less carried information by calculating the information entropy of each attribute, and identifying the redundant attributes of which the information is contained by other attributes by calculating the mutual information among the attributes. The redundant attributes "takeoff airport" and "execution date" are deleted. And converting the data type, namely converting the information which only has an identification function in the non-numerical attribute into an integer value type which is easy to process and use subsequently. The information contained in the "restricted content" attribute is difficult to quantify and is deleted after comprehensive consideration. And checking logic errors, considering the physical meanings of the features, establishing a constraint relation among the features, and eliminating the logic errors. Checking the corresponding relation between the machine type and the number of the engines, checking the precedence relation of each time node in scene operation, and directly deleting the information items with logic errors.

Specifically, the data set is divided into two parts, a training set and a test set. Of which 90% of the data is the training set used in the training phase of the model and 10% is used as the test set to verify the model's validity and robustness. That is, the training set is homologously homogenous with the test set. After the final processed data set is obtained, 10% is reserved from the data set for testing before training of the machine learning model, and the remaining 90% of the data set is used as a training set to train the machine learning model.

In this embodiment, the index defining and quantizing module includes:

specifically, a macroscopic space-time network topological model is adopted to model the traffic situation of the airport scene. Fig. 2 visualizes the general situation of the network topology during taxiing in any time-space domain departure and arrival. In actual operation of an airport scene, the processes of sliding in and sliding out are coupled and interdependent. Therefore, the influence of the harbor entry on the harbor exit process is simultaneously considered in the model. The space-time network topological model is a general framework for describing the flow of macroscopic resources of an airport system, and is shown as d in FIG. 2₁,...,d₄Indicating and referencing departure flights d₀All four different relationships of (1) are "before launch, before launch", "before launch, after launch", "after launch, before launch" and "after launch, after launch". Similarly, Port a₁,...,a₄Representing and referencing inbound flights a₀All four different relationships of (1) are "before landing, before landing", "before landing, after landing", "after landing, before landing" and "after landing", after landing ". t is t_on，t_inIndicating a reference inbound flight a₀Landing time and in-place time. t is t_out，t_offShowing the departure time and departure time of the reference departure flight. δ represents the time threshold for entry and departure.

In particular, on a macroscopic basisThe air network topological structure defines eight indexes of four types which represent scene traffic. The four categories are respectively scene instantaneous flow indexes (SIFIIs), Scene Cumulative Flow Indexes (SCFIs), Airplane Queuing Length Indexes (AQLIs) and Slot Resource Demand Indexes (SRDIs). Two statistics are computed in each category, the number of outgoing aircraft (prefixed by D-) and the number of incoming aircraft (prefixed by A-). The following table shows the values given in d₀Various statistics in the case of fig. 2 are referenced for departure flights.

Departure flight d₀Statistical result of scene traffic situation indexes

Taking fig. 2 as an example, the following describes the definition and calculation method of the index in table 1 in detail. For any departure flight d₀SIFIs include D-SIFII and A-SIFII, which are respectively expressed as D₀The number of flights leaving and entering the port are taxied when pushed out of the gate. SCFIs include D-SCFI and A-SCFI, which represent taxi periods and D for departing and departing aircraft, respectively₀The amount of overlap of taxi periods. AQLIs include D-AQLI and A-AQLI, each of which is D₀The number of aircraft taking off and landing on the runway during the entire taxiing process. The SRDIs include D-SRDI and A-SRDI, and are shown on aircraft D₀Departure groove [ t ]₀-δ,t₀+δ]During which the number of aircraft are launched and landed. In general, the value of δ may be set to be between 10 minutes and 30 minutes.

In this embodiment, the feature set extracting module includes:

and the original feature set extracting unit is suitable for extracting features influencing the scene slide-out time from the data set and the traffic condition indexes and forming an original feature set.

Specifically, the relevant factors influencing the scene slide-out time, which are acquired by the data set establishing module and the index defining and quantifying module, are sorted to form an original feature set. And processing the original feature set, and extracting new features from the original features to replace partial features in the original feature set.

The data set establishing module obtains the relevant factors influencing the scene slide-out time as follows: flight number, flight attribute, destination airport, planned takeoff time, model, affiliated department, pushout time, actual takeoff time, departure runway, departure stand, stand type, engine type, corridor entrance, whether limited or not, gate. The scene drawing time related factors obtained in S120 are: D-SIFI, D-SCFI, D-AQLI, D-SRDI, Corridor _ NO. And using the difference between the push-out time and the actual takeoff time as the scene sliding time to replace the original characteristics. And extracting new characteristics of month, day, week, hour and minute from the planned takeoff time to replace the original characteristics. And further dividing and analyzing the characteristics of the stand and the gate. And extracting the corresponding relation between the runway and the airplane position/gate as a new characteristic. The final set of raw features, i.e., candidate influencing factors, obtained is shown in the following table:

candidate influencing factors

Specifically, based on the analysis result of the feature analysis unit, important features are selected from the original feature set formed by the original feature set extraction unit, and a feature set for integrating the machine learning model is formed. And screening out the characteristics of which the correlation with the scene sliding time is small. Including "engine type", "stand type", "month", "week", "day", "minute". The finally obtained feature set, i.e. the influencing factors, is shown in the following table:

finally selected influencing factors

In this embodiment, the feature analysis unit includes:

one or more of a correlation measurement correlation coefficient, standardized mutual information and factor analysis are adopted to perform feature analysis on the features of the original feature set, and fig. 3, 4 and 5 show the Pearson correlation coefficient of the candidate influence factors and the slide-out time, the standardized mutual information of the candidate influence factors and the slide-out time and the factor analysis results of the candidate influence factors respectively.

In this embodiment, the model building module includes:

and the training unit is suitable for training the initial model and adjusting the value of the hyper-parameter so as to complete the establishment of the scene slide-out time prediction model.

In this embodiment, the training unit, namely: selecting 'maximum depth' as a control mode for controlling the decision tree; selecting 'least square' as a loss function; under the optimal product value, selecting the maximum learning rate and the corresponding minimum estimator quantity which can keep the performance stable; setting minimum samples to be divided into 200 according to the whole data distribution of the sliding-out time in the training set; and finishing the training of the initial model so as to establish a scene slide-out time prediction model.

Specifically, a GradientBoostedRegenerationTrees (GBRT) model, which is a typical representative of ensemble learning, is adopted to complete the prediction operation of the scene slide-out time. And taking the feature set obtained in the step S133 as the input of the model, and quickly training the GBRT model by executing the algorithm in the scimit-learn library. The hyper-parameters to be set are: the method comprises the steps of decision tree size control, loss function types, the number of estimators, learning rate and minimum sample division. There are two choices in controlling the size of the decision tree, the "maximum depth (max _ depth)" and the "maximum number of leaf nodes (max _ leaf _ nodes)", respectively. There are four alternative loss functions in the regression task, respectively "least squares (ls)", "minimum absolute deviation (lad)", "Huber loss (Huber)" and "quantile loss (quantile)". Since the learning rate and the number of estimators have a high degree of interaction, the product of the two roughly reflects the iterative training situation. Thus, when setting the parameters, different product values are set empirically and the product value that achieves the best performance in the training set is selected. And the minimum sample division is used for controlling the lower limit of the number of samples in the leaf node and improving the robustness of the model. In general, the value of the hyper-parameter needs to be reasonably adjusted according to the data condition of the application scene.

Specifically, GBRT model f (x) is an additive model of the form:

wherein h is_m(x) Is a basis function, commonly referred to as weak learner, gamma, under the concept of boosting_mIs the weight corresponding to the weak learner, and M is the sum of the number of weak learners. GBRT uses a fixed-size decision tree as a weak learner. Similar to other boosting algorithm ideas, GBRT greedily constructs an additive model:

F_m(x)＝F_m-1(x)+γ_mh_m(x)

wherein, Fm (x) represents the GBRT model obtained in the mth iteration. Wherein hm (x) is composed of

And (6) obtaining. n is the total number of training samples, L is the selected loss function, yi is the label of the ith sample, Fm-1(xi) is the predicted value of the GBRT model obtained from the (m-1) iteration on the ith sample, and h (xi) is the predicted value of the weak learner to be obtained on the ith sample. And gamma is_mBy

And (6) obtaining. n is the total number of training samples, L is the selected loss function, yi is the label of the ith sample, Fm-1(xi) is the predicted value of the GBRT model obtained from the (m-1) iteration on the ith sample

Initial model F₀Is problem-related, for least squares regression, the average of the target values is usually chosen.

That is, the model is not trained

In this embodiment, the model building module further includes:

Specifically, the field slide-out time prediction model is verified by using the test set, and a mean square error is adopted for performance evaluation in the performance evaluation, wherein the calculation formula is as follows:

In this example, MSE was used to monitor the performance change of the model during training and testing, with the results shown in fig. 6. Finally, the MSE reached 2.5 in the training set and the performance in the test set was 5.5. Although there is some distance in the MSE performance of the training and test sets, it reflects to some extent the generalization capability of the model.

On the other hand, the following table compares the prediction accuracy of the test set within different error ranges. In all test sets, 85.7% of the data sets had a glide time error within 3 minutes; over 93% of the data with prediction errors between 4 minutes; about 96.5% of the data, with an error of within 5 minutes. According to the verification result on the test set, the designed data mining model and algorithm can better meet the precision requirement of the actual scene dynamic slide-out time prediction task.

Test set accuracy within different error ranges

Error range	[-3,3]	[-4,4]	[-5,5]
				Accuracy of measurement	85.7％	93.1％	96.5％

In summary, the invention provides a scene variable slide-out time prediction system based on big data deep learning. The scene variable slide-out time prediction system based on big data deep learning comprises: the data set establishing module is suitable for acquiring historical operating data and performing data cleaning to obtain a data set; the index definition and quantification module is suitable for defining and quantifying the traffic condition index of the scene traffic characteristic; the characteristic set extraction module is suitable for analyzing and extracting a characteristic set influencing the slide-out time of the scene based on the data set and the traffic condition indexes; the model building module is suitable for building a scene slide-out time prediction model through an integrated machine learning method according to the feature set, and the prediction module is suitable for completing prediction of the scene slide-out time of the airport through the scene slide-out time prediction model. . The method comprises the steps of processing original recorded data of an airport, modeling traffic conditions of the airport scene, analyzing and extracting influence factors of sliding time, training a GBRT ensemble learning model, further obtaining a sliding-out time prediction model, and providing data basis for management and optimization of airport operation.

In light of the foregoing description of the preferred embodiment of the present invention, many modifications and variations will be apparent to those skilled in the art without departing from the spirit and scope of the invention. The technical scope of the present invention is not limited to the content of the specification, and must be determined according to the scope of the claims.

Claims

1. A scene variable slide-out time prediction system based on big data deep learning, comprising:

2. The big-data deep learning-based scene variable slide-out time prediction system of claim 1,

the data set building module comprises:

the data cleaning unit is suitable for cleaning the original data set;

3. The big-data deep learning-based scene variable slide-out time prediction system of claim 2,

the index definition and quantization module comprises:

4. The big-data deep learning-based scene variable slide-out time prediction system of claim 3,

the feature set extraction module comprises:

5. The big-data deep learning-based scene variable slide-out time prediction system of claim 4,

the feature analysis unit, namely:

performing feature analysis on the features of the original feature set by adopting one or more of correlation coefficient, standardized mutual information and factor analysis;

wherein, I_X,YIs mutual information of X, Y, H_X、H_YFor X, Y respective entropy, p (x, y) is the joint probability distribution of X, Y, p (x)) And p (y) is X, Y

Factor analysis, that is, the extracted feature x is completely controlled by a potential influence factor z, and the expression is x ═ Az + epsilon, where a is a coefficient matrix and epsilon is an error, and the influence factors are independent from each other and the influence factors and the error are independent from each other, and finally, the derivation is carried out: sigma_x＝AA^T+∑_εWhere Σ represents the covariance matrix, so that a and z can be found.

6. The big-data deep learning-based scene variable slide-out time prediction system of claim 5,

the model building module comprises:

7. The big-data deep learning-based scene variable slide-out time prediction system of claim 6,

the training unit, namely:

selecting least squares as a loss function;

8. The big-data deep learning-based scene variable slide-out time prediction system of claim 7,

the model building module also comprises

9. The big-data deep learning-based scene variable slide-out time prediction system of claim 8,

the performance evaluation in the model test unit adopts a mean square error, and the calculation formula is as follows: