CN111626508B

CN111626508B - Track traffic vehicle-mounted data prediction method based on xgboost model

Info

Publication number: CN111626508B
Application number: CN202010460661.8A
Authority: CN
Inventors: 王晓玲; 李欣
Original assignee: East China Normal University
Current assignee: East China Normal University
Priority date: 2020-05-27
Filing date: 2020-05-27
Publication date: 2023-12-22
Anticipated expiration: 2040-05-27
Also published as: CN111626508A

Abstract

The invention discloses a rail transit vehicle-mounted data prediction method based on an xgboost model, which comprises the steps of firstly collecting rail transit vehicle-mounted data, extracting representative vehicle-mounted data features from all vehicle-mounted data features of the rail transit based on a CART decision tree, extracting data representative of the vehicle-mounted data features from original vehicle-mounted data, taking the data representative of the vehicle-mounted data features as vehicle-mounted data after feature extraction, constructing the xgboost model according to the vehicle-mounted data after feature extraction and corresponding labels thereof, and acquiring the vehicle-mounted data input xgboost model in the actual running process of the rail transit to obtain a prediction result of a parking distance. According to the invention, the characteristic representing the vehicle-mounted data is extracted based on the CART decision tree, and the xgboost model is constructed according to the vehicle-mounted data after the characteristic extraction, so that the accuracy of predicting the rail transit parking distance can be effectively improved.

Description

Track traffic vehicle-mounted data prediction method based on xgboost model

Technical Field

The invention belongs to the technical field of rail transit, and particularly relates to a rail transit vehicle-mounted data prediction method based on an xgboost model.

Background

The track traffic trip increasingly becomes an indispensable part of urban life, hundreds of sensors are distributed on trains and lines to monitor various data in the running process of the trains, and the data are huge in workload for judging the reasons of faults and errors of the trains by means of manual analysis. Meanwhile, analysis of the sensor data is also beneficial to timely adjusting the running parameters of the train, and better traveling experience is provided for passengers. Data analysis is also becoming increasingly important to various companies, with analysis of historical data and giving future predictions based thereon being the primary task of data analysis.

The vehicle-mounted data of the rail transit can represent the data format of most application fields, has huge data volume, multiple characteristics and rich data types, and is taken as an indispensable transportation tool for urban travel, and the analysis of the rail transit data is an indispensable part in rail transit operation. However, with the transition of the age and the change of technology, the traditional manual analysis means cannot meet the increasing data volume and the new analysis requirement. With the rapid development of artificial intelligence and machine learning, data-driven business is growing increasingly, and it has become common practice in the industry to perform data cleaning, feature selection and feature combination by using a machine learning algorithm and to construct a model to analyze mass data.

For feature extraction of rail transit data, there are general methods such as principal component analysis PCA (Principal Component Analysis) and correlation coefficient method. The principal component analysis PCA can compress massive features and retain important features, but the PCA is only suitable for the situation that the variables have strong correlation, and the feature extraction effect is not ideal for data with weak correlation. At the same time, a small amount of data may be lost during the feature extraction process, and the meaning of the data may change, so that the interpretation is weaker than that of the original data. The correlation coefficient method is sensitive to data organization, requires linear correlation of data, and if the data is in nonlinear correlation, such as square relation, the correlation coefficient can be small.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, and provides a rail transit vehicle-mounted data prediction method based on an xgboost model, which is used for extracting the characteristic of representing vehicle-mounted data based on a CART decision tree and constructing the xgboost model according to the vehicle-mounted data after the characteristic extraction so as to improve the accuracy of the rail transit parking distance prediction.

In order to achieve the aim of the invention, the rail transit vehicle-mounted data prediction method based on the xgboost model comprises the following steps:

s1: setting the vehicle-mounted data characteristics of M rail transit according to actual requirements, collecting the values of the M vehicle-mounted data characteristics during N times of parking in the actual running process of the rail transit, and recording the value of the M vehicle-mounted data characteristics during the nth time of parking as f _nm N=1, 2, …, N, m=1, 2, …, M, the M vehicle-mounted data features obtained at each parking are constructed as one piece of vehicle-mounted data F _n ＝{f _n1 ,f _n2 ,…,f _nM Simultaneously recording the distance d between the train door and the shielding door when the stop is completed _n Taking the data as a label corresponding to the vehicle-mounted data;

s2: constructing a CART decision tree according to the N pieces of rail transit vehicle-mounted data and the corresponding labels thereof obtained in the step S1, then extracting vehicle-mounted data features serving as dividing points each time in a hierarchical traversing manner from a root node to a leaf node of the generated CART decision tree, wherein the vehicle-mounted data features are representing vehicle-mounted data features, the number of representing vehicle-mounted data features is recorded as P, the P pieces of data representing the vehicle-mounted data features are extracted from the original N pieces of vehicle-mounted data, and the extracted N pieces of vehicle-mounted data are vehicle-mounted data after feature extraction;

s3: constructing an xgboost model according to the vehicle-mounted data after feature extraction and the corresponding label thereof;

s4: in the track traffic running process, P values representing the characteristics of vehicle-mounted data at the current moment are collected and input into an xgboost model to obtain a prediction result of the parking distance.

According to the rail transit vehicle-mounted data prediction method based on the xgboost model, firstly, rail transit vehicle-mounted data are collected, vehicle-mounted data representing features are extracted from all vehicle-mounted data features of the rail transit based on a CART decision tree, data representing the vehicle-mounted data features are extracted from original vehicle-mounted data and serve as vehicle-mounted data after feature extraction, the xgboost model is built according to the vehicle-mounted data after feature extraction and corresponding labels, and vehicle-mounted data in the actual running process of the rail transit are collected and input into the xgboost model to obtain a prediction result of the parking distance. According to the invention, the characteristic representing the vehicle-mounted data is extracted based on the CART decision tree, and the xgboost model is constructed according to the vehicle-mounted data after the characteristic extraction, so that the accuracy of predicting the rail transit parking distance can be effectively improved.

Drawings

Fig. 1 is a flowchart of a specific embodiment of the rail transit vehicle-mounted data prediction method based on the xgboost model.

Detailed Description

The following description of the embodiments of the invention is presented in conjunction with the accompanying drawings to provide a better understanding of the invention to those skilled in the art. It is to be expressly noted that in the description below, detailed descriptions of known functions and designs are omitted here as perhaps obscuring the present invention.

Examples

Fig. 1 is a flowchart of a specific embodiment of the rail transit vehicle-mounted data prediction method based on the xgboost model. As shown in fig. 1, the specific steps of the rail transit vehicle-mounted data prediction method based on the xgboost model comprise:

s101: collecting rail transit vehicle-mounted data:

setting the vehicle-mounted data characteristics of M rail transit according to actual requirements, collecting the values of the M vehicle-mounted data characteristics during N times of parking in the actual running process of the rail transit, and recording the value of the M vehicle-mounted data characteristics during the nth time of parking as f _nm N=1, 2, …, N, m=1, 2, …, M, the M vehicle-mounted data features obtained at each parking are constructed as one piece of vehicle-mounted data F _n ＝{f _n1 ,f _n2 ,…,f _nM Simultaneously recording the distance y between the train door and the shielding door when the parking is completed _n As a tag corresponding to the in-vehicle data.

Therefore, each piece of vehicle-mounted data of the rail transit comprises M-dimensional characteristics, each piece of vehicle-mounted data marks the state of a train at the corresponding moment, and the tag is the distance between the train door and the shielding door and is the result brought by the piece of vehicle-mounted data. Table 1 is an example of the in-vehicle data in the present embodiment.

TABLE 1

The time series indicates the time corresponding to the vehicle-mounted data collection and also indicates the time when the train is stopped.

S102: extracting representative vehicle-mounted data characteristics:

because the characteristics of the vehicle-mounted data are abnormally rich in the practical application, the value of M can be quite large, but the parking distance of the train is not influenced or is little influenced by the multi-dimensional data. Therefore, in order to simplify the calculation process and speed up the calculation, it is necessary to extract data affecting the parking distance from these mass data. The representative feature extraction is to extract features capable of representing the characteristics of data from mass features, and the correlation coefficient method and principal component analysis PCA have limitations, so the invention provides a CART (Classification And Regression Trees) decision tree-based representative vehicle-mounted data feature extraction method, which comprises the following specific steps:

constructing a CART decision tree according to the N pieces of rail transit vehicle-mounted data and the corresponding labels thereof obtained in the step S101, then, extracting vehicle-mounted data features serving as dividing points each time in a hierarchical traversing mode from a root node to a leaf node of the generated CART decision tree, wherein the vehicle-mounted data features are representing vehicle-mounted data features, the number of representing vehicle-mounted data features is recorded as P, the P pieces of data representing the vehicle-mounted data features are extracted from the original N pieces of vehicle-mounted data, and the extracted N pieces of vehicle-mounted data are the vehicle-mounted data with the extracted features.

The CART decision tree is constructed by recursively dividing each sub-region into two sub-regions in an input space where a training set is located and determining an output value on each sub-region, and the method can be briefly described as follows:

traversing each value of each vehicle-mounted data feature aiming at original vehicle-mounted data containing M vehicle-mounted data features, dividing the original N pieces of vehicle-mounted data into two sets by using the value, respectively calculating the mean square errors of the two sets, searching to obtain the value which enables the sum of the mean square errors of the two sets to be minimum, wherein the vehicle-mounted data feature corresponding to the value is the optimal dividing feature of the dividing point, and the value is the optimal dividing value. And then dividing the two sets obtained by division by searching the optimal division characteristics and the optimal division values until the termination condition is reached.

S103: constructing an xgboost model:

and constructing an xgboost model according to the vehicle-mounted data after the feature extraction and the corresponding label thereof.

The xgboost model is a relatively common learning model in recent years, integrates a plurality of models based on an integration idea, can well utilize the training result of the previous model, further trains the residual error of the model, and has excellent performance on most regression and classification problems. The xgboost model is an iterative model, and comprises a plurality of CART decision trees, and the generation of the latter decision tree is obtained by fitting the residual error of the former decision tree. For a specific principle and construction process of the xgboost model, reference may be made to paper "Tianqi Chen and Carlos guestin. Xgboost: A Scalable Tree Boosting system. In 22nd SIGKDD Conference on Knowledge Discovery and Data Mining,2016".

In order to make the performance of the constructed xgboost model better, the vehicle-mounted data with the characteristics reduced in dimension can be divided into a training set and a testing set, each decision tree in the xgboost model is firstly constructed by using the training set, then each decision tree is respectively tested by adopting the testing set, and the decision tree pruning operation can be further carried out on the decision tree with larger error.

S104: and (3) predicting the parking distance:

in the track traffic running process, P values representing the characteristics of vehicle-mounted data at the current moment are collected and input into an xgboost model to obtain a prediction result of the parking distance.

In order to better illustrate the technical effect of the invention, the invention is experimentally verified by adopting a specific example, and 476 test samples are adopted in total in the test. Table 2 is a table comparing the predicted value and the actual value of the partial stopping distance in the present embodiment.

TABLE 2

As shown in Table 2, the predicted value and the true value of the parking distance obtained by the method are very close, and the average error of the test sample is 0.0000087mm after statistics, so that the method can completely meet the requirements of practical application.

Table 3 is a comparison table of stopping distance predictions for xgboost models before and after feature data extraction in accordance with the present invention.

TABLE 3 Table 3

As shown in Table 3, the characteristic data representing the characteristics of the vehicle-mounted data is extracted from the historical vehicle-mounted data of the rail transit, so that the performance of the xgboost model obtained by construction is better, and compared with the traditional xgboost model which is not extracted by the characteristic data and is constructed by directly adopting the original vehicle-mounted data, the xgboost model obtained by the invention has better performance in three performance evaluation indexes of MSE (Mean Squared Error, mean Square error), R-Square (determination coefficient) and MAE (Mean Absolute Error, average absolute error).

While the foregoing describes illustrative embodiments of the present invention to facilitate an understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, but is to be construed as protected by the accompanying claims insofar as various changes are within the spirit and scope of the present invention as defined and defined by the appended claims.

Claims

1. The rail transit vehicle-mounted data prediction method based on the xgboost model is characterized by comprising the following steps of:

s2: constructing a CART decision tree according to the N pieces of rail transit vehicle-mounted data and the corresponding labels thereof obtained in the step S1, then extracting vehicle-mounted data features serving as dividing points each time in a hierarchical traversing manner from a root node to a leaf node of the generated CART decision tree, wherein the vehicle-mounted data features are representing vehicle-mounted data features, the number of representing vehicle-mounted data features is recorded as P, the M pieces of data representing the vehicle-mounted data features are extracted from the original N pieces of vehicle-mounted data, and the extracted N pieces of vehicle-mounted data are vehicle-mounted data after feature extraction;