CN107563540B

CN107563540B - Method for predicting short-time bus boarding passenger flow based on random forest

Info

Publication number: CN107563540B
Application number: CN201710609933.4A
Authority: CN
Inventors: 王璞; 凌溪蔓
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2017-07-25
Filing date: 2017-07-25
Publication date: 2021-03-30
Anticipated expiration: 2037-07-25
Also published as: CN107563540A

Abstract

The invention provides a method for predicting the short-time bus boarding passenger flow based on a random forest, which comprises the following steps: acquiring passenger riding information and bus position information in a research area; calculating the getting-on station of the passenger according to the obtained passenger riding information and the bus position information; dividing a regional bus stop and a time window; training a random forest classifier, and establishing a regression prediction model; and constructing a prediction sample, inputting the prediction sample into a regression prediction model, and obtaining the predicted getting-on passenger flow of the bus stop in the target area in the target time window. The invention obtains a high-precision prediction result by providing a regional bus stop concept and adopting a random forest algorithm, and has practical guiding significance.

Description

Method for predicting short-time bus boarding passenger flow based on random forest

Technical Field

The invention relates to the technical field of traffic, in particular to a method for predicting the short-time bus boarding passenger flow based on a random forest.

Background

Public transport plays a leading role in the transportation capacity of the whole city, but at present, the public transport capacity of most domestic cities is insufficient, especially the urban public transportation capacity in peak hours is insufficient, and at the moment, the prediction and research on the short-term passenger flow volume at each bus stop is very important. The method can predict the short-time passenger flow of each bus stop, provide more reliable predicted passenger flow for a bus operation management system, play a role in adjusting bus transportation in time and relieve the crowdedness of bus passengers. However, the main research at present focuses on the optimization of the planning design of the urban public transportation network and the optimization of the public transportation management, and the following problems exist:

1. due to the less data support in the above aspects, qualitative analysis is often based.

2. The short-term prediction research on traffic flow is wide, but few short-term prediction research on the short-term passenger flow of the public transport is available.

3. The existing prediction is based on a single bus stop, and the prediction effect is poor due to the fact that the fluctuation of the passenger flow of the single bus stop is large.

Disclosure of Invention

The invention provides a method for predicting the short-time bus getting-on passenger flow based on a random forest, which can solve the problems in the prior art.

The invention provides a method for predicting the short-time bus boarding passenger flow based on a random forest, which comprises the following steps:

step S1: acquiring passenger riding information and bus position information in a research area;

step S2: calculating the getting-on station of the passenger according to the passenger taking information and the bus position information obtained in the step S1;

step S3: dividing a regional bus stop and a time window;

dividing the research area into square areas with the same size, numbering the square areas, aggregating bus stops contained in the same square to obtain regional bus stops, dividing the whole-day research time into time windows with the same size, and counting the passenger flow of each regional bus stop in each time window;

step S4: training a random forest classifier, and establishing a regression prediction model;

determining a target area bus stop and a target time window, taking the passenger flow volume of the target area bus stop in (d +1) time windows in n days before the prediction date of the target time window as a training sample, inputting the training sample into a random forest classifier for training, and establishing a regression prediction model;

wherein, the passenger flow volume of getting on the bus in (d +1) time windows every day is taken as a sample data, and n and d are integers;

step S5: constructing a prediction sample, inputting the prediction sample into a regression prediction model, and obtaining the predicted getting-on passenger flow of the bus stop in the target area in the target time window;

and selecting the boarding passenger flow of the target area bus stop in d time windows which are positioned on the same day and before the target time window as a prediction sample, inputting the prediction sample into the regression prediction model, and obtaining the predicted boarding passenger flow of the target area bus stop in the target time window, wherein d is an integer.

In the prior art, the passenger flow fluctuation of a single bus stop is large, so the passenger flow prediction effect based on the single bus stop is poor, and the practical guiding significance is not provided. The invention creatively provides a concept of 'regional bus stops' (namely step S3), and by taking the bus stops in a certain region as a set, the total passenger flow of all bus stops in the region is integrally counted and predicted, the travel information of residents in the region can be better reflected, and the predictability is better. The size of the grids of the regional bus stop can be flexibly determined according to the actual size of the whole research region and the positions and the number of the included bus stop positions. Meanwhile, as the passenger flow of the ground public transport is sparse compared with that of the subway, in order to achieve better statistical and prediction effects, the whole day time is divided into a plurality of equal time windows, and the passenger flow of the bus in each time window is counted and predicted so as to replace the passenger flow statistics and prediction of a certain time point, so that the ground public transport has better practical guiding significance.

Further, the step S1 specifically includes:

step S1.1: acquiring bus IC card swiping information of passengers in a research area through a bus-mounted card swiping machine, wherein the card swiping information comprises the identity numbers of the passengers, the boarding time and the number of the taken bus;

step S1.2: the method comprises the steps that driving track position information in a bus running time period is obtained through bus-mounted positioning equipment, and the driving track position information comprises a bus license plate number, track position point corresponding time, track position point corresponding longitude and track position point corresponding latitude.

Further, the step S2 specifically includes:

step S2.1: comparing the bus position information obtained in the step S1 with actual bus route data, and searching a position point matched with the bus position information from the bus route data, wherein the time information corresponding to the position point is the specific time when the bus arrives at each bus stop;

the bus line data comprises a line number, a station name, a station serial number, a station longitude and a station latitude;

step S2.2: and S1, comparing the passenger riding information obtained in the step S1 with the calculated specific time of the bus arriving at each bus stop, and calculating the getting-on stop of the passenger.

Specifically, the corresponding line number is determined through the bus number, then the corresponding longitude and the corresponding latitude of the track position point are compared with the longitude and the latitude of the stop, and the one-to-one correspondence relationship is established between the corresponding time of the track position point and the name of the stop when the longitude and the latitude of the track position point are close or the same, so that the specific time of each bus reaching each bus stop on the corresponding line is obtained.

And determining the corresponding bus according to the number plate of the bus taken in the passenger taking information, and then comparing the getting-on time of the passenger with the obtained specific time of the bus reaching each bus stop on the corresponding line, thereby obtaining the getting-on stop and the getting-on time of the passenger.

Further, the step S4 specifically includes:

step S4.1: determining a target area bus stop and a target time window, and acquiring the getting-on passenger flow of (d +1) time windows of the target area bus stop in n days before the prediction date of the target time window, namely the total getting-on passenger flow of (d +1) time windows of all bus stops in a grid area corresponding to the target area bus stop in n days before the prediction date of the target time window, wherein n and d are integers;

step S4.2: taking the passenger flow volume of the target area bus stop in D time windows before the target time window in n days as a first input parameter x of a training sample, taking the passenger flow volume of the target area bus stop in the n days and the target time window as a second input parameter y of the training sample, and constructing a training sample set D { (x)₁,y₁),(x₂,y₂),…,(x_i,y_i),…,(x_n,y_n)},x_i∈R_d,y_iE.g. R, each sample (x)_i,y_i) Both contain two input parameters, each sample containing the amount of boarding traffic for (d +1) time windows of the day, where x_iHaving a feature dimension d, i.e. x in each sample_iAll have d values representing the same time period of the day as the target time windowPassenger volume of getting on the bus, x, of the first d time windows_iForm matrix X ═ X₁,x₂,…,x_i,…,x_n)^TThe matrix X has n rows and d columns representing the amount of passenger traffic in d time windows preceding the period of coincidence with the target time window in n days, y_iHaving a characteristic dimension of 1, i.e. y in each sample_iAll have a value representing the amount of boarding traffic during the same time period of the day as the target time window, y_iForm a column matrix Y ═ (Y)₁,y₂,..,y_n)^TThe matrix Y has n rows and 1 columns representing known boarding traffic for the same time period as the target time window in n days, Y_iAnd x_iOne-to-one correspondence is realized;

step S4.3: training sample set D { (x)₁,y₁),(x₂,y₂),…,(x_i,y_i),…,(x_n,y_n)},x_i∈R^d,y_iAnd e.g. R training a random forest classifier, and establishing a regression prediction model.

Further, step S4.3 specifically includes:

using the training sample set D { (x)₁,y₁),(x₂,y₂),…,(x_i,y_i),…,(x_n,y_n) Setting the number t of CART classification regression trees and the depth deep of each tree as input parameters of a random forest algorithm, and performing model training by using f-dimensional characteristics for each node;

wherein t, deep and f are integers, and the value of f is the square root of d and d or the logarithm of d with 2 as the base;

step S4.3.1: sampling from the training sample set D to form a self-help sample set of the jth CART classification regression tree in the t CART classification regression trees in a returning mode, and dividing each tree into the self-help sample set corresponding to each tree by using nodes in each tree from a root node in sequence;

wherein j has a value ranging from 1 to t;

step S4.3.2: randomly selecting f-dimensional features from d-dimensional features of a training sample set at each node of each tree without putting back, seeking k-dimensional features with the best classification effect from the f-dimensional features, dividing the current nodes which do not meet termination conditions by taking the k-dimensional features as dividing features and taking a feature value corresponding to the k-dimensional features as a threshold value;

dividing a sample with the kth dimension characteristic smaller than a threshold value in the current node into a left node, dividing the rest samples in the current node into a right node, wherein the value range of k is 1-f;

step S4.3.3: dividing the current node meeting the termination condition into leaf nodes, wherein the predicted output of the leaf nodes is the average value of all sample values contained in the current node;

the method comprises the following steps that a termination condition is that when the number of samples contained in a current node is minimum and information gain is minimum, the current node stops splitting;

step S4.3.4: repeating steps S4.3.1 through S4.3.3 until all nodes are trained or marked as leaf nodes;

step S4.3.5: steps S4.3.1 through S4.3.4 are repeated until all CART classification regression trees are trained.

Further, the prediction process of the regression prediction model in step S5 specifically includes:

selecting a jth CART classification regression tree, dividing prediction samples from a root node of a current tree through the CART classification regression tree, dividing the prediction samples smaller than a threshold value to a left node according to the division characteristics and the threshold value of the nodes, dividing the rest samples to a right node until the leaf nodes of the current tree are reached, and outputting a predicted value; wherein j has a value ranging from 1 to t;

and repeating the operation until all the CART classification regression trees output predicted values, wherein the average value of the output predicted values of all the CART classification regression trees is the output value of the regression prediction model.

Advantageous effects

The invention provides a method for predicting the passenger flow of a short-time bus on the basis of a random forest aiming at the problem that passengers waiting for the bus cannot get on the bus due to excessive crowding in the bus at a peak time, namely the urban bus capacity at the peak time is insufficient.

Drawings

FIG. 1 is a flow chart of a method for predicting the passenger flow volume of a short-time bus based on a random forest according to an embodiment of the present invention;

fig. 2 and 3 are result presentation graphs for predicting passenger flow volume on Shenzhen short-term public transport by applying the method, wherein fig. 2 shows a fitting graph of the observed values 18:00-19:00 and the observed values 19:00-20:00 in 10 and 30 months in 2014, and fig. 3 shows fitting situations of the predicted values and the observed values 19:00-20:00 in all regions in 30 days 19:00-20:00 in 10 and 30 months in 2014.

Detailed Description

In order to better understand the method for predicting the short-time bus boarding passenger flow based on the random forest, which is provided by the invention, the following detailed explanation is carried out by combining a specific embodiment.

The embodiment of the invention provides a method for predicting the short-time bus boarding passenger flow based on a random forest, which is realized by the specific steps shown in figure 1. The embodiment uses the Shenzhen bus IC card swiping data and the bus GPS data from 10 month 11 date 2014 to 10 month 31 date 2014 in Shenzhen city. The specific implementation mode comprises the following steps:

step S1: acquiring card swiping data of bus IC cards from 10/11 th 2014 to 10/31 th 2014 in Shenzhen city, wherein each card swiping data records information such as 'passenger ID', 'boarding time', 'bus ID' and the like; acquiring bus-mounted GPS data from 10/11 th 2014 to 10/31 th 2014 in Shenzhen city, wherein the data comprises information such as 'bus ID', 'GPS point time', 'GPS point longitude', 'GPS point latitude', and the like.

Step S2: and calculating the boarding place and boarding time of the passenger according to the information, namely the bus stop where each card swiping record of the passenger occurs. In order to achieve the purpose, firstly, data fusion is carried out on GPS data and bus route and station data, and the specific time of the bus arriving at each bus station is calculated according to the data fusion; and then fusing the fused data with passenger IC card swiping data to calculate the boarding station and boarding time of the passenger. The bus route data includes "route number", "station name", "station serial number", "station longitude", and "station latitude". The data fusion mode is to reserve all fields of different data, specifically includes information of boarding stations of Shenzhen passengers from 11/10/2014 to 21/10/31/2014, and includes 14109 travel data of 293 passengers after being screened.

Step S3: dividing the whole research region, namely Shenzhen city (53914 bus stops in total), into squares (the size of each square is 1km multiplied by 1km) with the same size, numbering the squares, and then aggregating the bus stops contained in each square to form the region bus stops; meanwhile, the study of the embodiment uses card swiping data generated in a time period of 6:00-22:00 a day to count the passenger flow on the bus in each square in each time window every day, wherein the size of the time window is 1h, and 16 time windows are provided all day long.

Step S4: determining a target area bus stop (namely a square grid to be predicted) and a target time window (namely a time window to be predicted), and counting to obtain the passenger flow of the target area bus stop in (d +1) time windows in n days before the prediction date of the target time window.

Taking the passenger flow volume of the target area bus stop in D time windows before the time window of the target area bus stop in n days as a first input parameter x of a training sample, taking the passenger flow volume of the target area bus stop in the time window of the target area bus stop in n days as a second input parameter y of the training sample, and constructing a training sample set D { (x)₁,y₁),(x₂,y₂),…,(x_i,y_i),…,(x_n,y_n)},x_i∈R^d,y_iE.g. R, each sample (x)_i,y_i) Both contain two input parameters, each sample containing the amount of boarding traffic for (d +1) time windows of the dayWherein x is_iHaving a feature dimension d, i.e. x in each sample_iEach having d values representing the amount of boarding traffic d time windows before the same period as the target time window in a day, x_iForm matrix X ═ X₁,x₂,…,x_i,…,x_n)^TThe matrix X has n rows and d columns representing the amount of passenger traffic in d time windows preceding the period of coincidence with the target time window in n days, y_iHaving a characteristic dimension of 1, i.e. y in each sample_iAll have a value representing the amount of boarding traffic during the same time period of the day as the target time window, y_iForm a column matrix Y ═ (Y)₁,y₂,..,y_n)^TThe matrix Y has n rows and 1 columns representing known boarding traffic for the same time period as the target time window in n days, Y_iAnd x_iAnd correspond to each other.

Training sample set D { (x)₁,y₁),(x₂,y₂),…,(x_i,y_i),…,(x_n,y_n)},x_i∈R^d,y_iAnd e.g. R training a random forest classifier, and establishing a regression prediction model.

Step S5: acquiring the getting-on passenger flow of a bus stop in a target area in d time windows before the target time window on the same day; constructing a prediction sample x by using the passenger flow volume of the getting-on bus^*，x^*∈R^dPrediction sample x^*The characteristic dimension d is the same as that of each sample in the first input parameter x, namely the predicted sample and the first input parameter contain the getting-on passenger flow of the same number of time windows; will predict sample x^*And inputting the regression prediction model to obtain the predicted boarding passenger flow of the bus stop in the target area in the target time window.

Specifically, in this embodiment, the amount of passengers getting on the target grid in the time period from 10/27 th to 10/30 th 16:00-20:00 (i.e., time windows TW 16-TW 19) in 2014 is selected for research and analysis, and the data from 27 th to 29 th is used as a training set, and the data from 30 th is used as a prediction set. In the corresponding training feature set D, D is 3, which includes TW 16-TW 18, and represents 3 time windows; n is 3, and represents 27 to 29 daysThree days; prediction set x^*Representing the passenger flow on the bus stop in the target area of 16:00-19:00(TW 16-TW 18) on the day of 30. The machine learning algorithm used in this embodiment is a random forest, and the algorithm implementation includes two processes of training and prediction, which are specifically as follows:

training process:

1. the training set is a training set D { (x)₁,y₁),(x₂,y₂),…,(x_n,y_n)},x_i∈Rw,y_ie.R. test set as prediction set x^*∈R^dThe training feature set and the prediction set both have d-dimensional features. Therefore, t CART classification regression trees are formed, the depth of each tree is deep, each node uses f-dimensional characteristics, and when a certain node contains the least number of samples and the information gain is the least, the node stops splitting; in this embodiment, t is 10, deep is an initial value, and f is d.

2. And forming train (j) by sampling the training feature set D in a place-back manner, wherein the train (j) represents the training set of the jth CART classification regression tree, and the training is started from the root node, wherein j is 1,2,3, … and 10.

3. If the current node meets the termination condition, the current node is divided into leaf nodes, the prediction output of the leaf nodes is the average value of all sample values of a sample set contained in the current node, and then other nodes are continuously trained. If the current node does not meet the termination condition, randomly selecting f-dimensional features from the d-dimensional features according to a certain proportion without replacing (the value of f is generally d, sqrt (d) or log)₂(d) In this embodiment, if f ═ d is taken, the one-dimensional feature with the best classification effect (i.e., when the value obtained by subtracting the variance VarLeft of the left child node from the variance Var of the current node sample set and subtracting the variance VarRight of the right child node from the variance Var of the current node sample set is maximum) is found and is recorded as the kth-dimensional feature (1 ═ d)<k<f) And the corresponding characteristic value is a Threshold value Threshold, dividing the sample of which the k-dimensional characteristic of the current node is smaller than the Threshold value Threshold into a left node and dividing the rest samples into a right node, and then continuing to train other nodes.

4. And repeating the steps 2 and 3 until all the nodes are trained or marked as leaf nodes.

5. And repeating the steps 2,3 and 4 until all the CART classification regression trees are trained.

And (3) prediction process:

1. for the jth CART tree, starting from the root node of the current tree, dividing samples smaller than a Threshold value Threshold to a left node according to a division characteristic k and the Threshold value Threshold of the current node, dividing the remaining samples to a right node until reaching a certain leaf node, and outputting a predicted value.

2. And repeating the previous step until the t CART trees output predicted values, wherein the predicted values of the random forest model are the average values of the outputs of all the CART trees.

Specifically, in the present embodiment, the boarding traffic of 19:00-20:00 of all the squares in 10/month and 30/day is predicted, and the predicted values are compared with the observed values, and the result is shown in fig. 2 and fig. 3, and fig. 2 is a fitting graph showing the observed values in 18:00-19:00 and the observed values in 19:00-20:00 in 10/month and 30/year 2014; fig. 3 shows the fit of the predicted values and observed values for all regions at 19:00-20:00, 30 months 10 and 2014. R in the figure²For determining the coefficient (goodness of fit), the larger the goodness of fit is, the denser the points are near the regression line is, and the better the prediction effect is; RMSE represents the root mean square error, and a smaller value thereof indicates better prediction effect. As can be seen, the prediction effect of the method is excellent.

In summary, the method for predicting the passenger flow volume of the short-time bus based on the random forest, provided by the invention, aims at the problem that passengers waiting for the bus cannot get on the bus due to excessive congestion in the bus at the peak time, namely the problem that the urban bus transport capacity is insufficient at the peak time, predicts the passenger flow volume of the regional bus stop by providing the concept of the regional bus stop and mining and learning training historical data by means of a machine learning algorithm, provides a more reliable predicted passenger flow volume for a bus operation management system, plays a role in timely adjusting the bus transport capacity, and improves the service level of the public transport.

The above description is only exemplary of the present invention and should not be taken as limiting the invention, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for predicting the short-time bus boarding passenger flow based on a random forest is characterized by comprising the following steps:

step S3: dividing a regional bus stop and a time window;

selecting the boarding passenger flow of the target area bus stop in d time windows which are positioned on the same day and before the target time window as a prediction sample, inputting the prediction sample into the regression prediction model to obtain the predicted boarding passenger flow of the target area bus stop in the target time window, wherein d is an integer;

the step S4 specifically includes:

step S4.1: determining a target area bus stop and a target time window, and acquiring the passenger flow volume of the target area bus stop in (d +1) time windows in n days before the prediction date of the target time window, wherein n and d are integers;

step S4.2: taking the passenger flow volume of the target area bus stop in D time windows before the target time window in n days as a first input parameter x of a training sample, taking the passenger flow volume of the target area bus stop in the n days and the target time window as a second input parameter y of the training sample, and constructing a training sample set D { (x)₁,y₁),(x₂,y₂),…,(x_i,y_i),…,(x_n,y_n)},x_i∈R^d,y_iE.g. R, each sample (x)_i,y_i) Both contain two input parameters, where x_iHas a characteristic dimension of d, d being an integer;

2. The prediction method according to claim 1, wherein the step S1 specifically includes:

3. The prediction method according to claim 2, wherein the step S2 specifically includes:

4. The prediction method according to claim 3, characterized in that said step S4.3 is specifically:

wherein j has a value ranging from 1 to t;

5. The prediction method according to claim 4, wherein the prediction process of the regression prediction model in step S5 is specifically: