CN112988855A

CN112988855A - Subway passenger analysis method and system based on data mining

Info

Publication number: CN112988855A
Application number: CN202110562020.8A
Authority: CN
Inventors: 杨军; 叶谈; 唐英豪; 宫梦婕; 韩啸; 郑颖
Original assignee: China University of Mining and Technology Beijing CUMTB
Current assignee: China University of Mining and Technology Beijing CUMTB
Priority date: 2021-05-24
Filing date: 2021-05-24
Publication date: 2021-06-18

Abstract

The invention relates to a subway passenger analysis method and system based on data mining, which are characterized in that source data are obtained through passenger travel transaction records, the source data are processed and subjected to multidimensional analysis, subway passenger travel is more accurately and effectively classified, travel tracks of passengers are provided, and reliable datamation basis is provided for planning rail transit on the basis of guaranteeing network large passenger flow operation safety. According to the technical scheme, the final clustering number in the clustering process is selected by calculating the deviation coefficient, the appropriate clustering number can be quickly selected according to the data characteristics, the clustering result which is more consistent with the passenger distribution rule is obtained, the travel track of the passenger is analyzed by using a space-time analysis method, the address and the working place of the passenger are calculated, the approximate distance between the address and the working place is calculated, the travel condition of the subway passenger is deeply analyzed, and a reliable datamation basis is provided for the planning of the rail transit.

Description

Subway passenger analysis method and system based on data mining

Technical Field

The invention relates to the technical field of information data processing, in particular to a subway passenger analysis method and system based on data mining.

Background

The super-large-scale subway network operation faces the heavy traffic pressure under the normal state or the emergency, and the travel space-time trajectory of each passenger in the subway network contains the travel selection characteristics and the activity characteristics of each passenger in a specific time period. The real-time travel track of the passengers can provide detailed data basis for estimating the real-time full load rate of the train, monitoring the real-time distribution of network passenger flow, optimizing a passenger transportation organization scheme, formulating an elastic fare strategy and the like. In addition, the transfer proportion of passengers among different paths and the path selection behaviors of different types of passengers are obtained on the subway network level, the space-time correlation characteristic of passenger flow distribution can be more accurately mined, a quantitative basis is provided for formulation of a station passenger transportation organization scheme and active early warning of line passenger flow, and the intelligent level of active management and control of subway networked operation risks is improved.

With the development of urban rail transit construction in China and the rapid advance of urbanization, how to meet the increasing travel demands of residents through reasonable rail transit design becomes an urgent problem. The traditional orbit trip behavior analysis model and method for directly observing the pedestrian flow and the station throughput are difficult to meet the requirements of more accuracy and refinement. Meanwhile, the travel law of the resident track can well reflect the change of urban social space, and provide valuable reference for reasonable planning of the city.

Disclosure of Invention

Based on the above situation of the prior art, the invention aims to provide a subway passenger analysis method and system based on data mining, which obtain source data through passenger travel transaction records, process and perform multidimensional analysis on the source data, realize more accurate and effective classification of subway passenger travel, provide travel tracks of passengers, and provide reliable datamation basis for planning rail transit on the basis of guaranteeing network large passenger flow operation safety.

In order to achieve the above object, according to one aspect of the present invention, there is provided a subway passenger analysis method based on data mining, comprising the steps of:

acquiring source data of subway passenger travel transactions;

extracting passenger travel characteristic data from the source data as a clustering variable;

carrying out standardized scaling and normalization processing on the clustering variables;

clustering the passenger travel modes by adopting a clustering method, and determining the final clustering number k to obtain a classification result of the passenger travel modes; the number of clusters k is chosen such that the deviation factor is

Minimum k, coefficient of deviation

Comprises the following steps:

wherein the content of the first and second substances,

and

respectively is the mean value and standard deviation of the jth dimensionality of the ith clustering center, and L is the dimensionality of data;

and analyzing the travel track of the passenger by using a space-time analysis method.

Further, the passenger travel feature data extracted from the source data includes: the average time of the first arrival, the average time of the final departure, the monthly average travel days, the weekly average travel times, the average time length of the single travel and the station number recorded by swiping the card.

Further, the normalization processing is performed on the clustering variables, and the method comprises the following steps:

for each clustering variable, carrying out standardization conversion on the clustering variable, and standardizing the converted variable

Comprises the following steps:

calculating an entropy value

：

Calculating the weight of each dimension:

the normalized data were:

wherein the content of the first and second substances,

is the data value of the jth dimension of the ith passenger,

is the mean of the j dimension of the current data;

is the weight of the j-th dimension, W is the vector of the weights of the dimensions, x is

The vector of (a); l is a numberN is the total number of data according to the dimension.

Further, the clustering method for clustering the passenger travel modes comprises the following steps:

selecting K points as clustering center points;

for each datum, by distance𝐾The distance between each cluster center point is associated with the cluster center point with the closest distance, and all the points associated with the same cluster center point are integrated into a whole

Class (c):

min{d(i_

), d(i_

), d(i_

)…,d(i_

)}

wherein d (i _ \)

) Representing characteristic data

And

class center

The Euclidean distance of (c);

calculating the coordinate average value of each cluster, and moving the cluster center point associated with the cluster to the position of the average value:

=

wherein the content of the first and second substances,

the data is the c data belonging to the ith cluster, and m is the number of the data belonging to the ith cluster;

repeating the above steps until

The class center point does not change.

Further, the selecting a point as a cluster center point includes:

substituting the number of clusters and sample data into Gaussian mixture model

Obtaining the coordinates of the initial clustering central points after iteration

。

Further, the method for analyzing the travel trajectory of the passenger by using the space-time analysis method comprises the following steps:

calculating the address station and the work place station of the passenger according to the passenger travel characteristic data extracted from the source data;

and calculating the distance between the address and the working place according to the address station and the working place station.

Further, the estimating the address and the working place of the passenger according to the passenger travel feature data extracted from the source data includes:

according to the passenger travel characteristic data, calculating an entry point with the maximum probability of first travel every day, an exit point with the maximum probability of final travel every day and an exit point with the maximum probability of first travel every day;

the station entering with the maximum probability of first trip every day is the same as the station exiting with the maximum probability of final trip every day, and the station is an address station;

the station leaving with the maximum probability of first trip every day is the same as the station leaving with the maximum probability of first trip every day, and the station is a work place station.

Further, the calculating the distance between the address and the work place according to the address site and the work place site includes:

setting the latitude of the station as alpha, the longitude as beta, taking the center of the earth as an origin, taking the connecting line of the center of the earth and a 0-longitude point on the equator as an X axis, taking the connecting line of the center of the earth and a 90-DEG east longitude point on the equator as a Y axis, and taking the connecting line of the center of the earth and a north pole as a Z axis, establishing a coordinate system, and obtaining the coordinates (X, Y, Z) of the station:

x=R×cosα×cosβ

y=R×cosα×sinβ

z=R×sinα；

wherein R is the radius of the earth, north latitude, south latitude, east longitude and west longitude;

calculating the linear distance L between the address station and the work station:

wherein the coordinates of the addressed site are (

) The coordinates of the work site are: (

）；

Converting the linear distance L into an arc length C, thereby obtaining the distance between the address and the working place:

C=arcsin(L/2R)*Pi*R/90。

according to another aspect of the invention, a subway passenger analysis system based on data mining is provided, which comprises a data acquisition module, a data processing module, a passenger travel mode classification module and a passenger travel track analysis module; wherein the content of the first and second substances,

the data acquisition module acquires source data of trip transactions of subway passengers; extracting passenger travel characteristic data from the source data as a clustering variable;

the data processing module is used for carrying out standardized scaling and normalization processing on the extracted passenger trip characteristic data;

the passenger trip mode classification module classifies passenger trip modes by adopting a clustering method and determines the final clustering number k, wherein the clustering number k is the deviation coefficient

Minimum k, coefficient of deviation

Comprises the following steps:

wherein the content of the first and second substances,

and

the passenger travel track analysis module analyzes the travel track of the passenger by using a time-space analysis method.

In summary, the invention provides a subway passenger analysis method and system based on data mining, which obtains source data through passenger travel transaction records, processes and performs multidimensional analysis on the source data, realizes more accurate and effective classification of subway passengers traveling, provides travel tracks of passengers, and provides reliable datamation basis for planning rail transit on the basis of guaranteeing network large passenger flow operation safety. The technical scheme of the invention has the following beneficial technical effects:

(1) the final clustering number in the clustering process is selected by calculating the deviation coefficient, the quality of the final clustering result is greatly influenced by the selection of the clustering number, the final cost function is large due to the fact that the clustering number is too small, the cost function is very small although the clustering number is too small, the classification number is too large, the actual effect is poor, the clustering number is selected through the deviation coefficient, the proper clustering number can be selected quickly according to the data characteristics, and the clustering result which is more in line with the passenger distribution rule is obtained.

(2) The cluster variables are subjected to standardized scaling and normalization before data analysis, so that the standardization degree of the data to be processed is improved, the characteristic ranges of the cluster variables are not greatly different, and the calculation precision and the convergence speed are influenced.

(3) The initial clustering center points obtained by the Gaussian model can quickly obtain appropriate initial clustering centers, and the appropriate initial clustering center points can improve the algorithm fitting speed, so that the clustering result is more effective and reasonable.

(4) The travel track of the passenger is analyzed by using a space-time analysis method, the address and the working place of the passenger and the approximate distance between the address and the working place are calculated, the travel condition of the passenger in the subway is further analyzed, and a reliable datamation basis is provided for the planning of the rail transit.

Drawings

FIG. 1 is a flow chart of a subway passenger analysis method based on data mining according to the present invention;

fig. 2 is a block diagram of the subway passenger analysis system based on data mining according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings in conjunction with the following detailed description. It should be understood that the description is intended to be exemplary only, and is not intended to limit the scope of the present invention. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present invention.

The technical solution of the present invention will be described in detail below with reference to the accompanying drawings. According to an embodiment of the invention, a subway passenger analysis method based on data mining is provided, a flow chart of the method is shown in fig. 1, and the method comprises the following steps:

the method comprises the steps of obtaining source data of subway passenger trip transactions, and carrying out simple data cleaning on the source data, such as removing abnormal values, null values, extreme values and records which do not accord with corresponding rules. The source data sources of the subway passenger trip transaction comprise an IC card, a two-dimensional code riding record and the like.

Passenger travel characteristic data are extracted from source data and serve as clustering variables, and can be obtained through query of a database programming language sql, for example, and the method comprises the following steps: the average time of the first arrival at the station, the average time of the final departure from the station, the monthly average travel days, the weekly average travel times, the average time length of the single travel and the station number recorded by swiping the card are dimensions of passenger travel characteristic data. Further, data such as an entry site with the maximum first trip probability, an exit site with the maximum first trip probability, an entry site with the maximum final trip probability, an exit site with the maximum final trip probability and the like can be calculated according to the characteristic data. The form and representation of the passenger travel characteristic data are shown in table 1:

table 1 passenger travel characteristic data form and example

And carrying out standardized scaling and normalization processing on the clustering variables. Converting the cluster variable into a shaping or floating point type, wherein the first average inbound time

And the average time of the final departure

Conversion to a format based on minutes, for example: 08:30 to 8 x 60+30= 510. OriginalThe characteristic range of each clustering variable is too large to be beneficial to the model calculation speed. In addition, the values of the clustering variables of the first arrival average time and the final departure average time are too large, and the clustering result is dominated by the two variables, so that the model accuracy and the convergence speed are influenced. Therefore, the normalization processing can be performed on the clustering variables according to the following steps:

for each clustering variable, wherein i represents the number of the clustering variable in the data, namely the data representing the ith passenger, j represents the dimension represented by the clustering variable, the clustering variable is subjected to standardization conversion, and the variable after the standardization conversion is carried out

Comprises the following steps:

is a specific value for the jth dimension of the ith passenger,

is the mean of the j dimension of the current data;

calculating an entropy value

：

Calculating the weight of each dimension:

the normalized data were:

wherein the content of the first and second substances,

is the data value of the jth dimension of the ith passenger,

is the mean of the j dimension of the current data;

The vector of (a); l is the data dimension and n is the total number of data, i.e. data for a total of n passengers.

And clustering the passenger travel modes by adopting a clustering method, for example, classifying clustering variables by adopting a k-means algorithm, and determining the final clustering number k to obtain a classification result of the passenger travel modes.

The optimization goal of the k-means algorithm is to minimize the sum of the distances between all passenger characteristic data and the cluster center point to which they belong, i.e. the cost function J:

j is a cost function, k is the clustering number (namely the number of the passenger travel mode types),

is the cluster center coordinate (namely the average value of the coordinates of the characteristic data of each cluster user), m is the total number of sample points (namely the number of passengers participating in clustering),

for the sample points (i.e. characteristic data of each passenger),

，

coordinate points representing characteristic data of passengers

And (4) the coordinates of the cluster centers to which the clusters belong. The specific clustering process is as follows:

and selecting K points as cluster center points. The determination of the K cluster center points may be performed according to the following steps:

number of clusters𝐾Substituting the sample data into a Gaussian mixture model

Wherein, in the step (A),

the sum of the proportion of all clusters is 1;

the mean vector, i.e. the central coordinate point of each cluster,

the covariance matrix is an L multiplied by L matrix, and L is a data dimension;

initializing Gaussian mixture model parameters and calculating the posterior probability generated by each mixture component

（

) I =1, 2, … …, K; j =1, 2, … …, L. After iterating the above formula, calculating a new mean vector

Covariance matrix

And coefficient of mixing

：

Updating parameters of Gaussian mixture model to

Thereby obtaining initial cluster center coordinates

。

For each data

According to distance𝐾The distance between each cluster central point is associated with the cluster central point with the nearest distance, and all the points associated with the same cluster central point are clustered into a cluster:

min{d(i_

), d(i_

), d(i_

)…,d(i_

)}

wherein d (i _ \)

) Representing feature data and cluster centers

The Euclidean distance of (c);

calculating the coordinate mean value of each cluster, moving the cluster center point associated with the group to the position of the mean value:

=

wherein the content of the first and second substances,

and repeating the steps until the clustering center point is not changed.

The k value selection influences the quality of the final clustering result. If k is too small, the final cost function J is larger, and if k is too large, although the cost function J is very small, the number of classification is too many, so that the actual effect is not good. In the embodiment, the clustering number k is selected by using the deviation coefficient, so that the proper clustering number can be quickly selected according to the data characteristics, and the clustering result more conforming to the passenger distribution rule is obtained.

The number of clusters k is chosen such that the deviation factor is

Minimum k, coefficient of deviation

Comprises the following steps:

wherein the content of the first and second substances,

and

respectively are the mean value and the standard deviation of the jth dimensionality of the ith clustering center, and L is the dimensionality of the data.

The travel track of the passenger is analyzed by a time-space analysis method, and the method can be carried out according to the following steps:

according to passenger travel characteristic data extracted from the source data, estimating the address station and the work place station of the passenger:

according to the passenger travel characteristic data, calculating an entry point with the maximum probability of first travel every day, an exit point with the maximum probability of final travel every day and an exit point with the maximum probability of first travel every day; the station entering with the maximum probability of first trip every day is the same as the station exiting with the maximum probability of final trip every day, and the station is an address station; the station leaving with the maximum probability of first trip every day is the same as the station leaving with the maximum probability of first trip every day, and the station is a work place station.

According to the address station and the working site station, calculating the distance between the address and the working site:

x=R×cosα×cosβ

y=R×cosα×sinβ

z=R×sinα；

wherein the coordinates of the addressed site are (

) The coordinates of the work site are: (

）；

C=arcsin(L/2R)*Pi*R/90。

the results of classifying the passenger appearance patterns according to the present embodiment can be shown in table 2:

table 2 passenger appearance pattern classification results

Pattern 1 late-out-late-fall

Mode 2 conventional type (the characteristics are not very different from the mean)

Mode 3 Exit type (days and times of the average trip are higher than the mean)

Mode 4 early morning and late evening return type (less station record, longer trip time each time)

According to the scheme of the embodiment, the personal preference of the passenger can be roughly judged.

From the perspective of information theory, the lower the frequency of occurrence of an object, the greater the amount of information that is embedded therein.

f (P) the amount of information corresponding to a P event, P being the probability of that event. The lower the probability, the larger the information content

The personal preference of the passenger can be judged by the event corresponding to the personal trip record with lower proportion frequency. By a method based on spatio-temporal analysis, for example, a concert is being held at time a and a passenger X has an outbound record at a P site near the spot during that time, and the probability of the occurrence at that site is less than the probability at the average site, it can be determined that the passenger is participating in the concert and likes music.

Event(s)

Time, space, character

Personal preferences

The process of judging that the passenger participates in certain activity and hobby labeling is as follows:

1. the space accords with:

(a) crawling the time, the place and the belonging preference type of various activities of the Internet by using the web crawler; the hobbies can be preliminarily divided into music, drama, sports, parent and child classes, exhibitions, vocals and the like;

(b) crawling the longitude and latitude of each activity place on the subway bus station and the Internet; calculating the linear distance through the longitude and latitude, and recording subway stations four before each activity place.

2. The character and time are in accordance with:

(a) before the activity begins, the passenger has a transaction record of the departure of the passenger at the first four stations away from the activity site;

(b) after the event is finished, the passenger has a transaction record of the passenger who enters at the site before the distance of the event site;

(c) the passenger has no travel transaction record during the course of the activity.

3. Space, time, passenger's final screening that the persona all satisfied:

and judging whether the nearby sites of the activity place are common sites for passenger travel, inquiring the frequency of each site in the travel record of the passenger, marking the sites as the common sites if the frequency is larger than the average frequency, and otherwise, marking the sites as the uncommon sites. Passengers who are frequent stops are excluded and the remaining passengers are marked with personal preferences.

According to another embodiment of the invention, a subway passenger analysis system based on data mining is provided, and the system is configured as shown in fig. 2 and comprises a data acquisition module, a data processing module, a passenger travel mode classification module and a passenger travel track analysis module.

The data acquisition module is used for acquiring source data of subway passenger travel transactions; extracting passenger travel characteristic data from the source data as a clustering variable;

Minimum k, coefficient of deviation

Comprises the following steps:

wherein the content of the first and second substances,

and

and the passenger travel track analysis module analyzes the travel track of the passenger by using a space-time analysis method.

In summary, the invention relates to a subway passenger analysis method and system based on data mining, which obtains source data through passenger travel transaction records, processes and performs multidimensional analysis on the source data, realizes more accurate and effective classification of subway passengers, provides travel tracks of passengers, and provides reliable datamation basis for planning rail transit on the basis of guaranteeing large network passenger flow operation safety. The technical scheme of the invention has the following beneficial technical effects:

(4) The travel track of the passenger is analyzed by using a space-time analysis method, the address and the working place of the passenger and the approximate distance between the address and the working place are calculated, the travel condition of the passenger in the subway is further analyzed, and a reliable datamation basis is provided for the planning of the rail transit. .

It is to be understood that the above-described embodiments of the present invention are merely illustrative of or explaining the principles of the invention and are not to be construed as limiting the invention. Therefore, any modification, equivalent replacement, improvement and the like made without departing from the spirit and scope of the present invention should be included in the protection scope of the present invention. Further, it is intended that the appended claims cover all such variations and modifications as fall within the scope and boundaries of the appended claims or the equivalents of such scope and boundaries.

Claims

1. A subway passenger analysis method based on data mining is characterized by comprising the following steps:

acquiring source data of subway passenger travel transactions;

Minimum k, coefficient of deviation

Comprises the following steps:

wherein the content of the first and second substances,

and

2. The method of claim 1, wherein the passenger travel feature data extracted from the source data comprises: the average time of the first arrival, the average time of the final departure, the monthly average travel days, the weekly average travel times, the average time length of the single travel and the station number recorded by swiping the card.

3. The method of claim 2, wherein normalizing the cluster variables comprises the steps of:

Comprises the following steps:

calculating an entropy value

：

Calculating the weight of each dimension:

the normalized data were:

wherein the content of the first and second substances,

is the data value of the jth dimension of the ith passenger,

is the mean of the j dimension of the current data;

The vector of (a); l is the data dimension and n is the total number of data.

4. The method according to claim 3, wherein the clustering method for the passenger travel patterns comprises the steps of:

selecting K points as clustering center points;

Class (c):

min{d(i_

), d(i_

), d(i_

)…,d(i_

)}

wherein d (i _ \)

) Representing characteristic data and

class center

The Euclidean distance of (c);

=

wherein the content of the first and second substances,

repeating the above steps until

The class center point does not change.

5. The method of claim 4, wherein selecting the point as a cluster center point comprises:

substituting the clustering number and the sample data into a Gaussian mixture model;

。

6. The method according to claim 5, wherein the travel trajectory of the passenger is analyzed by using a spatiotemporal analysis method, comprising the following steps:

7. The method of claim 6, wherein estimating the address and the work place of the passenger based on the passenger travel feature data extracted from the source data comprises:

8. The method of claim 7, wherein estimating the distance between the address and the work site based on the address site and the work site comprises:

x=R×cosα×cosβ

y=R×cosα×sinβ

z=R×sinα；

wherein the coordinates of the addressed site are (

) The coordinates of the work site are: (

）；

C=arcsin(L/2R)*Pi*R/90。

9. a subway passenger analysis system based on data mining is characterized by comprising a data acquisition module, a data processing module, a passenger travel mode classification module and a passenger travel track analysis module; wherein the content of the first and second substances,

Minimum k, coefficient of deviation

Comprises the following steps:

wherein the content of the first and second substances,

and