CN111626508A

CN111626508A - Rail transit vehicle-mounted data prediction method based on xgboost model

Info

Publication number: CN111626508A
Application number: CN202010460661.8A
Authority: CN
Inventors: 王晓玲; 李欣
Original assignee: East China Normal University
Current assignee: East China Normal University
Priority date: 2020-05-27
Filing date: 2020-05-27
Publication date: 2020-09-04
Anticipated expiration: 2040-05-27
Also published as: CN111626508B

Abstract

The invention discloses a rail transit vehicle-mounted data prediction method based on an xgboost model, which comprises the steps of firstly collecting rail transit vehicle-mounted data, extracting vehicle-mounted data characteristics from all vehicle-mounted data characteristics of rail transit based on a CART decision tree, extracting data representing the vehicle-mounted data characteristics from original vehicle-mounted data to be used as vehicle-mounted data after characteristic extraction, constructing the xgboost model according to the vehicle-mounted data after characteristic extraction and a corresponding label thereof, collecting the vehicle-mounted data in the actual running process of rail transit, inputting the vehicle-mounted data into the xgboost model, and obtaining a prediction result of parking distance. The method extracts the representative vehicle-mounted data features based on the CART decision tree, constructs the xgboost model according to the vehicle-mounted data after feature extraction, and can effectively improve the accuracy of the rail transit stopping distance prediction.

Description

Rail transit vehicle-mounted data prediction method based on xgboost model

Technical Field

The invention belongs to the technical field of rail transit, and particularly relates to a rail transit vehicle-mounted data prediction method based on an xgboost model.

Background

The rail transit trip becomes an indispensable part of urban life increasingly, and the train and the circuit are distributed hundreds of sensors and are used for monitoring various data in the train operation, and the data are large in workload for judging the reason that the train breaks down and has errors by purely depending on manual analysis. Meanwhile, the analysis of the sensor data is also helpful for adjusting the running parameters of the train in time, so that better trip experience is provided for passengers. Data analysis is also increasingly valued by various companies, and analyzing historical data and giving future predictions based on the historical data is the most important task of data analysis.

The rail transit vehicle-mounted data can represent data formats of most application fields, the data volume is huge, the characteristics are multiple, the data types are rich, and as an indispensable vehicle for urban trip, rail transit data analysis is an indispensable part in rail transit operation. However, with the change of times and the revolution of technologies, the traditional manual analysis means cannot meet the increasing data volume and the new analysis requirements. With the rapid development of artificial intelligence and machine learning, data-driven services are increasing day by day, and it has become a common practice in the industry to perform data cleaning, feature selection and feature combination by using a machine learning algorithm and construct a model to analyze mass data.

For the feature extraction of the rail transit data, the common methods generally include principal component analysis (pca) (principal component analysis), a correlation coefficient method, and the like. The PCA can compress mass features and retain more important features, but the PCA is only suitable for the situation that strong correlation exists among variables, and the feature extraction effect is not ideal for data with weak correlation. Meanwhile, a small amount of data may be lost in the feature extraction process, and the meaning of the data may change, which is less interpretative than the original data. The correlation coefficient method is sensitive to data organization, and requires linear correlation of data, and if the data is non-linearly correlated, such as a square relationship, the correlation coefficient may be very small.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a rail transit vehicle-mounted data prediction method based on an xgboost model.

In order to achieve the purpose, the rail transit vehicle-mounted data prediction method based on the xgboost model comprises the following steps:

s1: setting M vehicle-mounted data characteristics of rail transit according to actual needs, collecting values of the M vehicle-mounted data characteristics during N times of parking in the actual running process of the rail transit, and recording the value of the mth vehicle-mounted data characteristic during the nth time of parking as f_nmN is 1,2, …, N, M is 1,2, …, M, and M vehicle-mounted data features obtained at each parking are constructed as a piece of vehicle-mounted data F_n＝{f_n1,f_n2,…,f_nMAnd simultaneously recording the distance d between the train door and the shield door when the stop is finished_nThe data is used as a label corresponding to the vehicle-mounted data;

s2: constructing a CART decision tree according to the N pieces of rail transit vehicle-mounted data obtained in the step S1 and the corresponding labels thereof, then extracting vehicle-mounted data characteristics which are used as dividing points each time from a root node to a leaf node of the generated CART decision tree in a hierarchical traversal mode, wherein the vehicle-mounted data characteristics are vehicle-mounted data characteristics, the number of the vehicle-mounted data characteristics is recorded as P, the P pieces of data representing the vehicle-mounted data characteristics are extracted from the original N pieces of vehicle-mounted data, and the N pieces of vehicle-mounted data obtained by extraction are vehicle-mounted data after characteristic extraction;

s3: constructing an xgboost model according to the vehicle-mounted data after the characteristic extraction and the corresponding label thereof;

s4: in the rail transit operation process, P values representing vehicle-mounted data characteristics at the current moment are collected and input into an xgboost model to obtain a prediction result of the parking distance.

The invention relates to a rail transit vehicle-mounted data prediction method based on an xgboost model, which comprises the steps of firstly collecting rail transit vehicle-mounted data, extracting vehicle-mounted data characteristics from all vehicle-mounted data characteristics of rail transit based on a CART decision tree, extracting data representing the vehicle-mounted data characteristics from original vehicle-mounted data to serve as vehicle-mounted data after characteristic extraction, constructing the xgboost model according to the vehicle-mounted data after characteristic extraction and a corresponding label thereof, collecting the vehicle-mounted data in the actual running process of rail transit, inputting the vehicle-mounted data into the xgboost model, and obtaining the prediction result of the parking distance. The method extracts the representative vehicle-mounted data features based on the CART decision tree, constructs the xgboost model according to the vehicle-mounted data after feature extraction, and can effectively improve the accuracy of the rail transit stopping distance prediction.

Drawings

Fig. 1 is a flowchart of an embodiment of a rail transit vehicle-mounted data prediction method based on an xgboost model.

Detailed Description

The following description of the embodiments of the present invention is provided in order to better understand the present invention for those skilled in the art with reference to the accompanying drawings. It is to be expressly noted that in the following description, a detailed description of known functions and designs will be omitted when it may obscure the subject matter of the present invention.

Examples

Fig. 1 is a flowchart of an embodiment of a rail transit vehicle-mounted data prediction method based on an xgboost model. As shown in fig. 1, the method for predicting the vehicle-mounted data of the rail transit based on the xgboost model comprises the following specific steps:

s101: collecting rail transit vehicle-mounted data:

setting M vehicle-mounted data characteristics of rail transit according to actual needs, collecting values of the M vehicle-mounted data characteristics during N times of parking in the actual running process of the rail transit, and recording the value of the mth vehicle-mounted data characteristic during the nth time of parking as f_nmN is 1,2, …, N, M is 1,2, …, M, and M vehicle-mounted data features obtained at each parking are constructed as a piece of vehicle-mounted data F_n＝{f_n1,f_n2,…,f_nMAnd simultaneously recording the distance y between the train door and the shielding door when the parking is finished_nThis is used as a tag corresponding to the vehicle-mounted data.

Therefore, each piece of vehicle-mounted data of the rail transit comprises the M-dimensional characteristics, each piece of vehicle-mounted data indicates the state of the train at the corresponding moment, and the label is the distance between the train door and the shield door and is the result brought by the piece of vehicle-mounted data. Table 1 is an example of the vehicle-mounted data in the present embodiment.

TABLE 1

The time column represents the time corresponding to the vehicle-mounted data collection and also represents the time when the train stops.

S102: extracting representative vehicle-mounted data characteristics:

as the vehicle-mounted data features are extremely rich in practical application, the value of M may be very large, but the parking distance of the train is not influenced or is slightly influenced by a lot of multidimensional data. Therefore, in order to simplify the calculation process and increase the calculation speed, data having an influence on the parking distance needs to be extracted from the massive data. The representative feature extraction is to extract features capable of representing data features from mass features, And because a correlation coefficient method And Principal Component Analysis (PCA) have limitations, the invention provides a representative vehicle-mounted data feature extraction method based on a CART (classification And Regression Trees) decision tree, which comprises the following specific steps:

and (2) constructing a CART decision tree according to the N pieces of rail transit vehicle-mounted data obtained in the step (S101) and the corresponding labels thereof, then extracting vehicle-mounted data characteristics which are used as dividing points each time from a root node to leaf nodes of the generated CART decision tree in a hierarchical traversal mode, wherein the vehicle-mounted data characteristics are vehicle-mounted data characteristics, the number of the vehicle-mounted data characteristics is recorded as P, the P pieces of data representing the vehicle-mounted data characteristics are extracted from the original N pieces of vehicle-mounted data, and the extracted N pieces of vehicle-mounted data are the vehicle-mounted data after the characteristics are extracted.

The CART decision tree is a binary tree constructed by recursively dividing each sub-region into two sub-regions and determining an output value on each sub-region in an input space where a training set is located, and the method can be briefly described as follows:

and traversing each value of each vehicle-mounted data characteristic aiming at the original vehicle-mounted data containing M vehicle-mounted data characteristics, dividing the original N pieces of vehicle-mounted data into two sets by using the values, respectively calculating the mean square errors of the two sets, searching to obtain the value which minimizes the sum of the mean square errors of the two sets, wherein the vehicle-mounted data characteristic corresponding to the value is the optimal division characteristic of the division point, and the value is the optimal division value. And then dividing the two sets obtained by division by searching for the optimal division characteristics and the optimal division values until a termination condition is reached.

S103: constructing an xgboost model:

and constructing an xgboost model according to the vehicle-mounted data after the feature extraction and the corresponding label thereof.

The xgboost model is a relatively common learning model in recent years, integrates a plurality of models based on an integration idea, can well utilize a training result of the previous model to further train residual errors of the models, and is excellent in most regression and classification problems. The xgboost model is an iterative model, which includes multiple CART decision trees, and the generation of the latter decision tree is obtained by fitting the residuals of the former decision tree. The specific principle and construction process of the xgboost model can be referred to the paper "Tianqi Chen and cars Guestin. XGboost: A scalable Tree Boosting System. In 22nd SIGKDD Conference on Knowledge Discovery and data Mining, 2016".

In order to make the performance of the constructed xgboost model better, the vehicle-mounted data after feature dimensionality reduction can be divided into a training set and a test set, firstly, the training set is used for constructing each decision tree in the xgboost model, then, the test set is adopted for testing each decision tree, and for the decision tree with larger error, the decision tree pruning operation can be further carried out.

S104: and (3) predicting the parking distance:

in the rail transit operation process, P values representing vehicle-mounted data characteristics at the current moment are collected and input into an xgboost model to obtain a prediction result of the parking distance.

In order to better illustrate the technical effect of the invention, the invention is experimentally verified by using a specific example, and 476 test samples are used in total. Table 2 is a comparison table of the predicted value and the actual value of the partial parking distance in the present embodiment.

TABLE 2

As shown in Table 2, the predicted value and the true value of the parking distance obtained by the method are very close, and the average error of the test sample is 0.0000087mm through statistics, so that the requirement of practical application can be completely met.

Table 3 is a parking distance prediction comparison table of the xgboost model before and after feature data extraction of the present invention.

TABLE 3

As shown in Table 3, the characteristic data which can better reflect the data characteristics is extracted from the rail transit historical vehicle-mounted data by extracting the characteristic which represents the vehicle-mounted data, so that the performance of the constructed xgboost model is better, and compared with the conventional xgboost model which is constructed by directly adopting the original vehicle-mounted data without extracting the characteristic data, the xgboost model obtained by the method has more excellent performances in three performance evaluation indexes, namely Mean Squared Error, R-Square (determination coefficient) and MAE (Mean absolute Error).

Although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, and various changes may be made apparent to those skilled in the art as long as they are within the spirit and scope of the present invention as defined and defined by the appended claims, and all matters of the invention which utilize the inventive concepts are protected.

Claims

1. A rail transit vehicle-mounted data prediction method based on an xgboost model is characterized by comprising the following steps:

s2: constructing a CART decision tree according to the N pieces of rail transit vehicle-mounted data obtained in the step S1 and the corresponding labels thereof, then extracting vehicle-mounted data characteristics which are used as dividing points each time from a root node to a leaf node of the generated CART decision tree in a hierarchical traversal mode, wherein the vehicle-mounted data characteristics are vehicle-mounted data characteristics, the number of the vehicle-mounted data characteristics is recorded as P, the M pieces of data representing the vehicle-mounted data characteristics are extracted from the original N pieces of vehicle-mounted data, and the N pieces of extracted vehicle-mounted data are vehicle-mounted data after the characteristics are extracted;