CN112988855A - Subway passenger analysis method and system based on data mining - Google Patents

Subway passenger analysis method and system based on data mining Download PDF

Info

Publication number
CN112988855A
CN112988855A CN202110562020.8A CN202110562020A CN112988855A CN 112988855 A CN112988855 A CN 112988855A CN 202110562020 A CN202110562020 A CN 202110562020A CN 112988855 A CN112988855 A CN 112988855A
Authority
CN
China
Prior art keywords
passenger
data
travel
clustering
station
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110562020.8A
Other languages
Chinese (zh)
Inventor
杨军
叶谈
唐英豪
宫梦婕
韩啸
郑颖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China University of Mining and Technology Beijing CUMTB
Original Assignee
China University of Mining and Technology Beijing CUMTB
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China University of Mining and Technology Beijing CUMTB filed Critical China University of Mining and Technology Beijing CUMTB
Priority to CN202110562020.8A priority Critical patent/CN112988855A/en
Publication of CN112988855A publication Critical patent/CN112988855A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/40Business processes related to the transportation industry

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Probability & Statistics with Applications (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Business, Economics & Management (AREA)
  • Computational Linguistics (AREA)
  • Human Resources & Organizations (AREA)
  • Mathematical Physics (AREA)
  • Fuzzy Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Economics (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to a subway passenger analysis method and system based on data mining, which are characterized in that source data are obtained through passenger travel transaction records, the source data are processed and subjected to multidimensional analysis, subway passenger travel is more accurately and effectively classified, travel tracks of passengers are provided, and reliable datamation basis is provided for planning rail transit on the basis of guaranteeing network large passenger flow operation safety. According to the technical scheme, the final clustering number in the clustering process is selected by calculating the deviation coefficient, the appropriate clustering number can be quickly selected according to the data characteristics, the clustering result which is more consistent with the passenger distribution rule is obtained, the travel track of the passenger is analyzed by using a space-time analysis method, the address and the working place of the passenger are calculated, the approximate distance between the address and the working place is calculated, the travel condition of the subway passenger is deeply analyzed, and a reliable datamation basis is provided for the planning of the rail transit.

Description

Subway passenger analysis method and system based on data mining
Technical Field
The invention relates to the technical field of information data processing, in particular to a subway passenger analysis method and system based on data mining.
Background
The super-large-scale subway network operation faces the heavy traffic pressure under the normal state or the emergency, and the travel space-time trajectory of each passenger in the subway network contains the travel selection characteristics and the activity characteristics of each passenger in a specific time period. The real-time travel track of the passengers can provide detailed data basis for estimating the real-time full load rate of the train, monitoring the real-time distribution of network passenger flow, optimizing a passenger transportation organization scheme, formulating an elastic fare strategy and the like. In addition, the transfer proportion of passengers among different paths and the path selection behaviors of different types of passengers are obtained on the subway network level, the space-time correlation characteristic of passenger flow distribution can be more accurately mined, a quantitative basis is provided for formulation of a station passenger transportation organization scheme and active early warning of line passenger flow, and the intelligent level of active management and control of subway networked operation risks is improved.
With the development of urban rail transit construction in China and the rapid advance of urbanization, how to meet the increasing travel demands of residents through reasonable rail transit design becomes an urgent problem. The traditional orbit trip behavior analysis model and method for directly observing the pedestrian flow and the station throughput are difficult to meet the requirements of more accuracy and refinement. Meanwhile, the travel law of the resident track can well reflect the change of urban social space, and provide valuable reference for reasonable planning of the city.
Disclosure of Invention
Based on the above situation of the prior art, the invention aims to provide a subway passenger analysis method and system based on data mining, which obtain source data through passenger travel transaction records, process and perform multidimensional analysis on the source data, realize more accurate and effective classification of subway passenger travel, provide travel tracks of passengers, and provide reliable datamation basis for planning rail transit on the basis of guaranteeing network large passenger flow operation safety.
In order to achieve the above object, according to one aspect of the present invention, there is provided a subway passenger analysis method based on data mining, comprising the steps of:
acquiring source data of subway passenger travel transactions;
extracting passenger travel characteristic data from the source data as a clustering variable;
carrying out standardized scaling and normalization processing on the clustering variables;
clustering the passenger travel modes by adopting a clustering method, and determining the final clustering number k to obtain a classification result of the passenger travel modes; the number of clusters k is chosen such that the deviation factor is
Figure 84228DEST_PATH_IMAGE001
Minimum k, coefficient of deviation
Figure 889373DEST_PATH_IMAGE001
Comprises the following steps:
Figure 677200DEST_PATH_IMAGE002
wherein the content of the first and second substances,
Figure 251401DEST_PATH_IMAGE003
and
Figure 690648DEST_PATH_IMAGE004
respectively is the mean value and standard deviation of the jth dimensionality of the ith clustering center, and L is the dimensionality of data;
and analyzing the travel track of the passenger by using a space-time analysis method.
Further, the passenger travel feature data extracted from the source data includes: the average time of the first arrival, the average time of the final departure, the monthly average travel days, the weekly average travel times, the average time length of the single travel and the station number recorded by swiping the card.
Further, the normalization processing is performed on the clustering variables, and the method comprises the following steps:
for each clustering variable, carrying out standardization conversion on the clustering variable, and standardizing the converted variable
Figure 451931DEST_PATH_IMAGE005
Comprises the following steps:
Figure 777870DEST_PATH_IMAGE006
calculating an entropy value
Figure DEST_PATH_IMAGE007
Figure 675419DEST_PATH_IMAGE008
Calculating the weight of each dimension:
Figure 61401DEST_PATH_IMAGE009
the normalized data were:
Figure DEST_PATH_IMAGE011
wherein the content of the first and second substances,
Figure 309980DEST_PATH_IMAGE012
is the data value of the jth dimension of the ith passenger,
Figure DEST_PATH_IMAGE013
is the mean of the j dimension of the current data;
Figure 174031DEST_PATH_IMAGE014
is the weight of the j-th dimension, W is the vector of the weights of the dimensions, x is
Figure 457244DEST_PATH_IMAGE012
The vector of (a); l is a numberN is the total number of data according to the dimension.
Further, the clustering method for clustering the passenger travel modes comprises the following steps:
selecting K points as clustering center points;
for each datum, by distance𝐾The distance between each cluster center point is associated with the cluster center point with the closest distance, and all the points associated with the same cluster center point are integrated into a whole
Figure 512663DEST_PATH_IMAGE015
Class (c):
Figure 514117DEST_PATH_IMAGE016
min{d(i_
Figure 181859DEST_PATH_IMAGE017
), d(i_
Figure 585158DEST_PATH_IMAGE018
), d(i_
Figure 312943DEST_PATH_IMAGE019
)…,d(i_
Figure 536114DEST_PATH_IMAGE020
)}
wherein d (i _ \)
Figure 273125DEST_PATH_IMAGE017
) Representing characteristic data
Figure 530931DEST_PATH_IMAGE021
And
Figure 429617DEST_PATH_IMAGE015
class center
Figure 405664DEST_PATH_IMAGE017
The Euclidean distance of (c);
calculating the coordinate average value of each cluster, and moving the cluster center point associated with the cluster to the position of the average value:
Figure 415208DEST_PATH_IMAGE022
=
Figure 527520DEST_PATH_IMAGE023
wherein the content of the first and second substances,
Figure 862687DEST_PATH_IMAGE024
the data is the c data belonging to the ith cluster, and m is the number of the data belonging to the ith cluster;
repeating the above steps until
Figure 561915DEST_PATH_IMAGE015
The class center point does not change.
Further, the selecting a point as a cluster center point includes:
substituting the number of clusters and sample data into Gaussian mixture model
Obtaining the coordinates of the initial clustering central points after iteration
Figure 640729DEST_PATH_IMAGE025
Further, the method for analyzing the travel trajectory of the passenger by using the space-time analysis method comprises the following steps:
calculating the address station and the work place station of the passenger according to the passenger travel characteristic data extracted from the source data;
and calculating the distance between the address and the working place according to the address station and the working place station.
Further, the estimating the address and the working place of the passenger according to the passenger travel feature data extracted from the source data includes:
according to the passenger travel characteristic data, calculating an entry point with the maximum probability of first travel every day, an exit point with the maximum probability of final travel every day and an exit point with the maximum probability of first travel every day;
the station entering with the maximum probability of first trip every day is the same as the station exiting with the maximum probability of final trip every day, and the station is an address station;
the station leaving with the maximum probability of first trip every day is the same as the station leaving with the maximum probability of first trip every day, and the station is a work place station.
Further, the calculating the distance between the address and the work place according to the address site and the work place site includes:
setting the latitude of the station as alpha, the longitude as beta, taking the center of the earth as an origin, taking the connecting line of the center of the earth and a 0-longitude point on the equator as an X axis, taking the connecting line of the center of the earth and a 90-DEG east longitude point on the equator as a Y axis, and taking the connecting line of the center of the earth and a north pole as a Z axis, establishing a coordinate system, and obtaining the coordinates (X, Y, Z) of the station:
x=R×cosα×cosβ
y=R×cosα×sinβ
z=R×sinα;
wherein R is the radius of the earth, north latitude, south latitude, east longitude and west longitude;
calculating the linear distance L between the address station and the work station:
Figure 873127DEST_PATH_IMAGE026
wherein the coordinates of the addressed site are (
Figure 848037DEST_PATH_IMAGE027
) The coordinates of the work site are: (
Figure 798675DEST_PATH_IMAGE028
);
Converting the linear distance L into an arc length C, thereby obtaining the distance between the address and the working place:
C=arcsin(L/2R)*Pi*R/90。
according to another aspect of the invention, a subway passenger analysis system based on data mining is provided, which comprises a data acquisition module, a data processing module, a passenger travel mode classification module and a passenger travel track analysis module; wherein the content of the first and second substances,
the data acquisition module acquires source data of trip transactions of subway passengers; extracting passenger travel characteristic data from the source data as a clustering variable;
the data processing module is used for carrying out standardized scaling and normalization processing on the extracted passenger trip characteristic data;
the passenger trip mode classification module classifies passenger trip modes by adopting a clustering method and determines the final clustering number k, wherein the clustering number k is the deviation coefficient
Figure 415601DEST_PATH_IMAGE001
Minimum k, coefficient of deviation
Figure 502506DEST_PATH_IMAGE001
Comprises the following steps:
Figure DEST_PATH_IMAGE029
wherein the content of the first and second substances,
Figure 648317DEST_PATH_IMAGE003
and
Figure 820672DEST_PATH_IMAGE004
respectively is the mean value and standard deviation of the jth dimensionality of the ith clustering center, and L is the dimensionality of data;
the passenger travel track analysis module analyzes the travel track of the passenger by using a time-space analysis method.
In summary, the invention provides a subway passenger analysis method and system based on data mining, which obtains source data through passenger travel transaction records, processes and performs multidimensional analysis on the source data, realizes more accurate and effective classification of subway passengers traveling, provides travel tracks of passengers, and provides reliable datamation basis for planning rail transit on the basis of guaranteeing network large passenger flow operation safety. The technical scheme of the invention has the following beneficial technical effects:
(1) the final clustering number in the clustering process is selected by calculating the deviation coefficient, the quality of the final clustering result is greatly influenced by the selection of the clustering number, the final cost function is large due to the fact that the clustering number is too small, the cost function is very small although the clustering number is too small, the classification number is too large, the actual effect is poor, the clustering number is selected through the deviation coefficient, the proper clustering number can be selected quickly according to the data characteristics, and the clustering result which is more in line with the passenger distribution rule is obtained.
(2) The cluster variables are subjected to standardized scaling and normalization before data analysis, so that the standardization degree of the data to be processed is improved, the characteristic ranges of the cluster variables are not greatly different, and the calculation precision and the convergence speed are influenced.
(3) The initial clustering center points obtained by the Gaussian model can quickly obtain appropriate initial clustering centers, and the appropriate initial clustering center points can improve the algorithm fitting speed, so that the clustering result is more effective and reasonable.
(4) The travel track of the passenger is analyzed by using a space-time analysis method, the address and the working place of the passenger and the approximate distance between the address and the working place are calculated, the travel condition of the passenger in the subway is further analyzed, and a reliable datamation basis is provided for the planning of the rail transit.
Drawings
FIG. 1 is a flow chart of a subway passenger analysis method based on data mining according to the present invention;
fig. 2 is a block diagram of the subway passenger analysis system based on data mining according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings in conjunction with the following detailed description. It should be understood that the description is intended to be exemplary only, and is not intended to limit the scope of the present invention. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present invention.
The technical solution of the present invention will be described in detail below with reference to the accompanying drawings. According to an embodiment of the invention, a subway passenger analysis method based on data mining is provided, a flow chart of the method is shown in fig. 1, and the method comprises the following steps:
the method comprises the steps of obtaining source data of subway passenger trip transactions, and carrying out simple data cleaning on the source data, such as removing abnormal values, null values, extreme values and records which do not accord with corresponding rules. The source data sources of the subway passenger trip transaction comprise an IC card, a two-dimensional code riding record and the like.
Passenger travel characteristic data are extracted from source data and serve as clustering variables, and can be obtained through query of a database programming language sql, for example, and the method comprises the following steps: the average time of the first arrival at the station, the average time of the final departure from the station, the monthly average travel days, the weekly average travel times, the average time length of the single travel and the station number recorded by swiping the card are dimensions of passenger travel characteristic data. Further, data such as an entry site with the maximum first trip probability, an exit site with the maximum first trip probability, an entry site with the maximum final trip probability, an exit site with the maximum final trip probability and the like can be calculated according to the characteristic data. The form and representation of the passenger travel characteristic data are shown in table 1:
table 1 passenger travel characteristic data form and example
Figure 975710DEST_PATH_IMAGE030
And carrying out standardized scaling and normalization processing on the clustering variables. Converting the cluster variable into a shaping or floating point type, wherein the first average inbound time
Figure 917121DEST_PATH_IMAGE031
And the average time of the final departure
Figure 764991DEST_PATH_IMAGE032
Conversion to a format based on minutes, for example: 08:30 to 8 x 60+30= 510. OriginalThe characteristic range of each clustering variable is too large to be beneficial to the model calculation speed. In addition, the values of the clustering variables of the first arrival average time and the final departure average time are too large, and the clustering result is dominated by the two variables, so that the model accuracy and the convergence speed are influenced. Therefore, the normalization processing can be performed on the clustering variables according to the following steps:
for each clustering variable, wherein i represents the number of the clustering variable in the data, namely the data representing the ith passenger, j represents the dimension represented by the clustering variable, the clustering variable is subjected to standardization conversion, and the variable after the standardization conversion is carried out
Figure 424643DEST_PATH_IMAGE005
Comprises the following steps:
Figure 881907DEST_PATH_IMAGE006
Figure 943403DEST_PATH_IMAGE033
is a specific value for the jth dimension of the ith passenger,
Figure 696596DEST_PATH_IMAGE013
is the mean of the j dimension of the current data;
calculating an entropy value
Figure 577964DEST_PATH_IMAGE007
Figure 340384DEST_PATH_IMAGE008
Calculating the weight of each dimension:
Figure 256387DEST_PATH_IMAGE009
the normalized data were:
Figure 446060DEST_PATH_IMAGE035
wherein the content of the first and second substances,
Figure 80304DEST_PATH_IMAGE012
is the data value of the jth dimension of the ith passenger,
Figure 380835DEST_PATH_IMAGE013
is the mean of the j dimension of the current data;
Figure 151345DEST_PATH_IMAGE014
is the weight of the j-th dimension, W is the vector of the weights of the dimensions, x is
Figure 246340DEST_PATH_IMAGE012
The vector of (a); l is the data dimension and n is the total number of data, i.e. data for a total of n passengers.
And clustering the passenger travel modes by adopting a clustering method, for example, classifying clustering variables by adopting a k-means algorithm, and determining the final clustering number k to obtain a classification result of the passenger travel modes.
The optimization goal of the k-means algorithm is to minimize the sum of the distances between all passenger characteristic data and the cluster center point to which they belong, i.e. the cost function J:
Figure 102300DEST_PATH_IMAGE037
j is a cost function, k is the clustering number (namely the number of the passenger travel mode types),
Figure 206523DEST_PATH_IMAGE038
is the cluster center coordinate (namely the average value of the coordinates of the characteristic data of each cluster user), m is the total number of sample points (namely the number of passengers participating in clustering),
Figure 97118DEST_PATH_IMAGE039
for the sample points (i.e. characteristic data of each passenger),
Figure 864479DEST_PATH_IMAGE040
Figure 207736DEST_PATH_IMAGE041
coordinate points representing characteristic data of passengers
Figure 850070DEST_PATH_IMAGE042
And (4) the coordinates of the cluster centers to which the clusters belong. The specific clustering process is as follows:
and selecting K points as cluster center points. The determination of the K cluster center points may be performed according to the following steps:
number of clusters𝐾Substituting the sample data into a Gaussian mixture model
Figure 595172DEST_PATH_IMAGE043
Wherein, in the step (A),
Figure 297549DEST_PATH_IMAGE044
the sum of the proportion of all clusters is 1;
Figure 862522DEST_PATH_IMAGE045
the mean vector, i.e. the central coordinate point of each cluster,
Figure 308547DEST_PATH_IMAGE046
the covariance matrix is an L multiplied by L matrix, and L is a data dimension;
initializing Gaussian mixture model parameters and calculating the posterior probability generated by each mixture component
Figure 173735DEST_PATH_IMAGE047
Figure 47013DEST_PATH_IMAGE048
) I =1, 2, … …, K; j =1, 2, … …, L. After iterating the above formula, calculating a new mean vector
Figure 99283DEST_PATH_IMAGE049
Covariance matrix
Figure 348999DEST_PATH_IMAGE050
And coefficient of mixing
Figure 803114DEST_PATH_IMAGE051
Figure 847293DEST_PATH_IMAGE053
Figure 885394DEST_PATH_IMAGE054
Figure 673221DEST_PATH_IMAGE055
Updating parameters of Gaussian mixture model to
Figure 247422DEST_PATH_IMAGE056
Thereby obtaining initial cluster center coordinates
Figure 462503DEST_PATH_IMAGE025
For each data
Figure 489365DEST_PATH_IMAGE021
According to distance𝐾The distance between each cluster central point is associated with the cluster central point with the nearest distance, and all the points associated with the same cluster central point are clustered into a cluster:
Figure 815304DEST_PATH_IMAGE016
min{d(i_
Figure 509590DEST_PATH_IMAGE017
), d(i_
Figure 895572DEST_PATH_IMAGE018
), d(i_
Figure 144151DEST_PATH_IMAGE019
)…,d(i_
Figure 273781DEST_PATH_IMAGE020
)}
wherein d (i _ \)
Figure 822574DEST_PATH_IMAGE017
) Representing feature data and cluster centers
Figure 379457DEST_PATH_IMAGE017
The Euclidean distance of (c);
calculating the coordinate mean value of each cluster, moving the cluster center point associated with the group to the position of the mean value:
Figure 380911DEST_PATH_IMAGE022
=
Figure 314232DEST_PATH_IMAGE023
wherein the content of the first and second substances,
Figure 953417DEST_PATH_IMAGE024
the data is the c data belonging to the ith cluster, and m is the number of the data belonging to the ith cluster;
and repeating the steps until the clustering center point is not changed.
The k value selection influences the quality of the final clustering result. If k is too small, the final cost function J is larger, and if k is too large, although the cost function J is very small, the number of classification is too many, so that the actual effect is not good. In the embodiment, the clustering number k is selected by using the deviation coefficient, so that the proper clustering number can be quickly selected according to the data characteristics, and the clustering result more conforming to the passenger distribution rule is obtained.
The number of clusters k is chosen such that the deviation factor is
Figure 681202DEST_PATH_IMAGE001
Minimum k, coefficient of deviation
Figure 169952DEST_PATH_IMAGE001
Comprises the following steps:
Figure 906964DEST_PATH_IMAGE029
wherein the content of the first and second substances,
Figure 899191DEST_PATH_IMAGE003
and
Figure 797877DEST_PATH_IMAGE004
respectively are the mean value and the standard deviation of the jth dimensionality of the ith clustering center, and L is the dimensionality of the data.
The travel track of the passenger is analyzed by a time-space analysis method, and the method can be carried out according to the following steps:
according to passenger travel characteristic data extracted from the source data, estimating the address station and the work place station of the passenger:
according to the passenger travel characteristic data, calculating an entry point with the maximum probability of first travel every day, an exit point with the maximum probability of final travel every day and an exit point with the maximum probability of first travel every day; the station entering with the maximum probability of first trip every day is the same as the station exiting with the maximum probability of final trip every day, and the station is an address station; the station leaving with the maximum probability of first trip every day is the same as the station leaving with the maximum probability of first trip every day, and the station is a work place station.
According to the address station and the working site station, calculating the distance between the address and the working site:
setting the latitude of the station as alpha, the longitude as beta, taking the center of the earth as an origin, taking the connecting line of the center of the earth and a 0-longitude point on the equator as an X axis, taking the connecting line of the center of the earth and a 90-DEG east longitude point on the equator as a Y axis, and taking the connecting line of the center of the earth and a north pole as a Z axis, establishing a coordinate system, and obtaining the coordinates (X, Y, Z) of the station:
x=R×cosα×cosβ
y=R×cosα×sinβ
z=R×sinα;
wherein R is the radius of the earth, north latitude, south latitude, east longitude and west longitude;
calculating the linear distance L between the address station and the work station:
Figure 773923DEST_PATH_IMAGE026
wherein the coordinates of the addressed site are (
Figure 49047DEST_PATH_IMAGE027
) The coordinates of the work site are: (
Figure 161359DEST_PATH_IMAGE028
);
Converting the linear distance L into an arc length C, thereby obtaining the distance between the address and the working place:
C=arcsin(L/2R)*Pi*R/90。
the results of classifying the passenger appearance patterns according to the present embodiment can be shown in table 2:
table 2 passenger appearance pattern classification results
Figure 230946DEST_PATH_IMAGE057
Pattern 1 late-out-late-fall
Mode 2 conventional type (the characteristics are not very different from the mean)
Mode 3 Exit type (days and times of the average trip are higher than the mean)
Mode 4 early morning and late evening return type (less station record, longer trip time each time)
According to the scheme of the embodiment, the personal preference of the passenger can be roughly judged.
From the perspective of information theory, the lower the frequency of occurrence of an object, the greater the amount of information that is embedded therein.
Figure 428709DEST_PATH_IMAGE058
f (P) the amount of information corresponding to a P event, P being the probability of that event. The lower the probability, the larger the information content
The personal preference of the passenger can be judged by the event corresponding to the personal trip record with lower proportion frequency. By a method based on spatio-temporal analysis, for example, a concert is being held at time a and a passenger X has an outbound record at a P site near the spot during that time, and the probability of the occurrence at that site is less than the probability at the average site, it can be determined that the passenger is participating in the concert and likes music.
Event(s)
Figure DEST_PATH_IMAGE059
Time, space, character
Figure 241945DEST_PATH_IMAGE060
Personal preferences
The process of judging that the passenger participates in certain activity and hobby labeling is as follows:
1. the space accords with:
(a) crawling the time, the place and the belonging preference type of various activities of the Internet by using the web crawler; the hobbies can be preliminarily divided into music, drama, sports, parent and child classes, exhibitions, vocals and the like;
(b) crawling the longitude and latitude of each activity place on the subway bus station and the Internet; calculating the linear distance through the longitude and latitude, and recording subway stations four before each activity place.
2. The character and time are in accordance with:
(a) before the activity begins, the passenger has a transaction record of the departure of the passenger at the first four stations away from the activity site;
(b) after the event is finished, the passenger has a transaction record of the passenger who enters at the site before the distance of the event site;
(c) the passenger has no travel transaction record during the course of the activity.
3. Space, time, passenger's final screening that the persona all satisfied:
and judging whether the nearby sites of the activity place are common sites for passenger travel, inquiring the frequency of each site in the travel record of the passenger, marking the sites as the common sites if the frequency is larger than the average frequency, and otherwise, marking the sites as the uncommon sites. Passengers who are frequent stops are excluded and the remaining passengers are marked with personal preferences.
According to another embodiment of the invention, a subway passenger analysis system based on data mining is provided, and the system is configured as shown in fig. 2 and comprises a data acquisition module, a data processing module, a passenger travel mode classification module and a passenger travel track analysis module.
The data acquisition module is used for acquiring source data of subway passenger travel transactions; extracting passenger travel characteristic data from the source data as a clustering variable;
the data processing module is used for carrying out standardized scaling and normalization processing on the extracted passenger trip characteristic data;
the passenger trip mode classification module classifies passenger trip modes by adopting a clustering method and determines the final clustering number k, wherein the clustering number k is the deviation coefficient
Figure 707299DEST_PATH_IMAGE001
Minimum k, coefficient of deviation
Figure 947787DEST_PATH_IMAGE001
Comprises the following steps:
Figure 632846DEST_PATH_IMAGE029
wherein the content of the first and second substances,
Figure 249773DEST_PATH_IMAGE003
and
Figure 336677DEST_PATH_IMAGE004
respectively is the mean value and standard deviation of the jth dimensionality of the ith clustering center, and L is the dimensionality of data;
and the passenger travel track analysis module analyzes the travel track of the passenger by using a space-time analysis method.
In summary, the invention relates to a subway passenger analysis method and system based on data mining, which obtains source data through passenger travel transaction records, processes and performs multidimensional analysis on the source data, realizes more accurate and effective classification of subway passengers, provides travel tracks of passengers, and provides reliable datamation basis for planning rail transit on the basis of guaranteeing large network passenger flow operation safety. The technical scheme of the invention has the following beneficial technical effects:
(1) the final clustering number in the clustering process is selected by calculating the deviation coefficient, the quality of the final clustering result is greatly influenced by the selection of the clustering number, the final cost function is large due to the fact that the clustering number is too small, the cost function is very small although the clustering number is too small, the classification number is too large, the actual effect is poor, the clustering number is selected through the deviation coefficient, the proper clustering number can be selected quickly according to the data characteristics, and the clustering result which is more in line with the passenger distribution rule is obtained.
(2) The cluster variables are subjected to standardized scaling and normalization before data analysis, so that the standardization degree of the data to be processed is improved, the characteristic ranges of the cluster variables are not greatly different, and the calculation precision and the convergence speed are influenced.
(3) The initial clustering center points obtained by the Gaussian model can quickly obtain appropriate initial clustering centers, and the appropriate initial clustering center points can improve the algorithm fitting speed, so that the clustering result is more effective and reasonable.
(4) The travel track of the passenger is analyzed by using a space-time analysis method, the address and the working place of the passenger and the approximate distance between the address and the working place are calculated, the travel condition of the passenger in the subway is further analyzed, and a reliable datamation basis is provided for the planning of the rail transit. .
It is to be understood that the above-described embodiments of the present invention are merely illustrative of or explaining the principles of the invention and are not to be construed as limiting the invention. Therefore, any modification, equivalent replacement, improvement and the like made without departing from the spirit and scope of the present invention should be included in the protection scope of the present invention. Further, it is intended that the appended claims cover all such variations and modifications as fall within the scope and boundaries of the appended claims or the equivalents of such scope and boundaries.

Claims (9)

1. A subway passenger analysis method based on data mining is characterized by comprising the following steps:
acquiring source data of subway passenger travel transactions;
extracting passenger travel characteristic data from the source data as a clustering variable;
carrying out standardized scaling and normalization processing on the clustering variables;
clustering the passenger travel modes by adopting a clustering method, and determining the final clustering number k to obtain a classification result of the passenger travel modes; the number of clusters k is chosen such that the deviation factor is
Figure DEST_PATH_IMAGE002
Minimum k, coefficient of deviation
Figure 143551DEST_PATH_IMAGE002
Comprises the following steps:
Figure DEST_PATH_IMAGE004
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE006
and
Figure DEST_PATH_IMAGE008
respectively is the mean value and standard deviation of the jth dimensionality of the ith clustering center, and L is the dimensionality of data;
and analyzing the travel track of the passenger by using a space-time analysis method.
2. The method of claim 1, wherein the passenger travel feature data extracted from the source data comprises: the average time of the first arrival, the average time of the final departure, the monthly average travel days, the weekly average travel times, the average time length of the single travel and the station number recorded by swiping the card.
3. The method of claim 2, wherein normalizing the cluster variables comprises the steps of:
for each clustering variable, carrying out standardization conversion on the clustering variable, and standardizing the converted variable
Figure DEST_PATH_IMAGE010
Comprises the following steps:
Figure DEST_PATH_IMAGE012
calculating an entropy value
Figure DEST_PATH_IMAGE014
Figure DEST_PATH_IMAGE016
Calculating the weight of each dimension:
Figure DEST_PATH_IMAGE018
the normalized data were:
Figure DEST_PATH_IMAGE020
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE022
is the data value of the jth dimension of the ith passenger,
Figure DEST_PATH_IMAGE024
is the mean of the j dimension of the current data;
Figure DEST_PATH_IMAGE026
is the weight of the j-th dimension, W is the vector of the weights of the dimensions, x is
Figure 968681DEST_PATH_IMAGE022
The vector of (a); l is the data dimension and n is the total number of data.
4. The method according to claim 3, wherein the clustering method for the passenger travel patterns comprises the steps of:
selecting K points as clustering center points;
for each datum, by distance𝐾The distance between each cluster center point is associated with the cluster center point with the closest distance, and all the points associated with the same cluster center point are integrated into a whole
Figure DEST_PATH_IMAGE028
Class (c):
Figure DEST_PATH_IMAGE030
min{d(i_
Figure DEST_PATH_IMAGE032
), d(i_
Figure DEST_PATH_IMAGE034
), d(i_
Figure DEST_PATH_IMAGE036
)…,d(i_
Figure DEST_PATH_IMAGE038
)}
wherein d (i _ \)
Figure 873052DEST_PATH_IMAGE032
) Representing characteristic data and
Figure 737103DEST_PATH_IMAGE028
class center
Figure 879371DEST_PATH_IMAGE032
The Euclidean distance of (c);
calculating the coordinate average value of each cluster, and moving the cluster center point associated with the cluster to the position of the average value:
Figure DEST_PATH_IMAGE040
=
Figure DEST_PATH_IMAGE042
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE044
the data is the c data belonging to the ith cluster, and m is the number of the data belonging to the ith cluster;
repeating the above steps until
Figure 764151DEST_PATH_IMAGE028
The class center point does not change.
5. The method of claim 4, wherein selecting the point as a cluster center point comprises:
substituting the clustering number and the sample data into a Gaussian mixture model;
obtaining the coordinates of the initial clustering central points after iteration
Figure DEST_PATH_IMAGE046
6. The method according to claim 5, wherein the travel trajectory of the passenger is analyzed by using a spatiotemporal analysis method, comprising the following steps:
calculating the address station and the work place station of the passenger according to the passenger travel characteristic data extracted from the source data;
and calculating the distance between the address and the working place according to the address station and the working place station.
7. The method of claim 6, wherein estimating the address and the work place of the passenger based on the passenger travel feature data extracted from the source data comprises:
according to the passenger travel characteristic data, calculating an entry point with the maximum probability of first travel every day, an exit point with the maximum probability of final travel every day and an exit point with the maximum probability of first travel every day;
the station entering with the maximum probability of first trip every day is the same as the station exiting with the maximum probability of final trip every day, and the station is an address station;
the station leaving with the maximum probability of first trip every day is the same as the station leaving with the maximum probability of first trip every day, and the station is a work place station.
8. The method of claim 7, wherein estimating the distance between the address and the work site based on the address site and the work site comprises:
setting the latitude of the station as alpha, the longitude as beta, taking the center of the earth as an origin, taking the connecting line of the center of the earth and a 0-longitude point on the equator as an X axis, taking the connecting line of the center of the earth and a 90-DEG east longitude point on the equator as a Y axis, and taking the connecting line of the center of the earth and a north pole as a Z axis, establishing a coordinate system, and obtaining the coordinates (X, Y, Z) of the station:
x=R×cosα×cosβ
y=R×cosα×sinβ
z=R×sinα;
wherein R is the radius of the earth, north latitude, south latitude, east longitude and west longitude;
calculating the linear distance L between the address station and the work station:
Figure DEST_PATH_IMAGE048
wherein the coordinates of the addressed site are (
Figure DEST_PATH_IMAGE050
) The coordinates of the work site are: (
Figure DEST_PATH_IMAGE052
);
Converting the linear distance L into an arc length C, thereby obtaining the distance between the address and the working place:
C=arcsin(L/2R)*Pi*R/90。
9. a subway passenger analysis system based on data mining is characterized by comprising a data acquisition module, a data processing module, a passenger travel mode classification module and a passenger travel track analysis module; wherein the content of the first and second substances,
the data acquisition module acquires source data of trip transactions of subway passengers; extracting passenger travel characteristic data from the source data as a clustering variable;
the data processing module is used for carrying out standardized scaling and normalization processing on the extracted passenger trip characteristic data;
the passenger trip mode classification module classifies passenger trip modes by adopting a clustering method and determines the final clustering number k, wherein the clustering number k is the deviation coefficient
Figure 830851DEST_PATH_IMAGE002
Minimum k, coefficient of deviation
Figure 623227DEST_PATH_IMAGE002
Comprises the following steps:
Figure DEST_PATH_IMAGE004A
wherein the content of the first and second substances,
Figure 88843DEST_PATH_IMAGE006
and
Figure 347786DEST_PATH_IMAGE008
respectively is the mean value and standard deviation of the jth dimensionality of the ith clustering center, and L is the dimensionality of data;
the passenger travel track analysis module analyzes the travel track of the passenger by using a time-space analysis method.
CN202110562020.8A 2021-05-24 2021-05-24 Subway passenger analysis method and system based on data mining Pending CN112988855A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110562020.8A CN112988855A (en) 2021-05-24 2021-05-24 Subway passenger analysis method and system based on data mining

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110562020.8A CN112988855A (en) 2021-05-24 2021-05-24 Subway passenger analysis method and system based on data mining

Publications (1)

Publication Number Publication Date
CN112988855A true CN112988855A (en) 2021-06-18

Family

ID=76337116

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110562020.8A Pending CN112988855A (en) 2021-05-24 2021-05-24 Subway passenger analysis method and system based on data mining

Country Status (1)

Country Link
CN (1) CN112988855A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113642625A (en) * 2021-08-06 2021-11-12 北京交通大学 Method and system for deducing individual trip purpose of urban rail transit passenger

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105718946A (en) * 2016-01-20 2016-06-29 北京工业大学 Passenger going-out behavior analysis method based on subway card-swiping data
CN111833229A (en) * 2020-03-28 2020-10-27 东南大学 Travel behavior space-time analysis method and device based on subway dependency

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105718946A (en) * 2016-01-20 2016-06-29 北京工业大学 Passenger going-out behavior analysis method based on subway card-swiping data
CN111833229A (en) * 2020-03-28 2020-10-27 东南大学 Travel behavior space-time analysis method and device based on subway dependency

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李飞羽: "城市轨道交通乘客行为特征分析及出行预测", 《中国优秀硕士学位论文全文数据库(电子期刊)工程科技Ⅱ辑》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113642625A (en) * 2021-08-06 2021-11-12 北京交通大学 Method and system for deducing individual trip purpose of urban rail transit passenger
CN113642625B (en) * 2021-08-06 2024-02-02 北京交通大学 Method and system for deducing individual travel purposes of urban rail transit passengers

Similar Documents

Publication Publication Date Title
CN109191896B (en) Personalized parking space recommendation method and system
CN111653097B (en) Urban trip mode comprehensive identification method based on mobile phone signaling data and containing personal attribute correction
CN110942198B (en) Passenger path identification method and system for rail transit operation
CN113159364A (en) Passenger flow prediction method and system for large-scale traffic station
CN110555544B (en) Traffic demand estimation method based on GPS navigation data
CN108806248B (en) Vehicle travel track division method for RFID electronic license plate data
CN113436433B (en) Efficient urban traffic outlier detection method
CN107610282A (en) A kind of bus passenger flow statistical system
CN110727714A (en) Resident travel feature extraction method integrating space-time clustering and support vector machine
Li et al. Using smart card data trimmed by train schedule to analyze metro passenger route choice with synchronous clustering
CN111581325A (en) K-means station area division method based on space-time influence distance
Guo et al. Exploring potential travel demand of customized bus using smartcard data
CN112988855A (en) Subway passenger analysis method and system based on data mining
CN108681741B (en) Subway commuting crowd information fusion method based on IC card and resident survey data
CN107730717B (en) A kind of suspicious card identification method of public transport based on feature extraction
CN108053646B (en) Traffic characteristic obtaining method, traffic characteristic prediction method and traffic characteristic prediction system based on time sensitive characteristics
CN113408833A (en) Public traffic key area identification method and device and electronic equipment
CN112733890A (en) Online vehicle track clustering method considering space-time characteristics
CN111292099A (en) Intelligent station anti-ticket-swiping method and anti-ticket-swiping system
CN110610446A (en) County town classification method based on two-step clustering thought
Nagy et al. Land-use zone estimation in public transport planning with data mining
Thiagarajan et al. Identification of passenger demand in public transport using machine learning
Zhang et al. Research on taxi driver strategy game evolution with carpooling detour
CN114742131A (en) Method for identifying urban excessive tourism area based on pattern mining
CN112988849B (en) Traffic track mode distributed mining method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210618

RJ01 Rejection of invention patent application after publication