CN112988855A - Subway passenger analysis method and system based on data mining - Google Patents
Subway passenger analysis method and system based on data mining Download PDFInfo
- Publication number
- CN112988855A CN112988855A CN202110562020.8A CN202110562020A CN112988855A CN 112988855 A CN112988855 A CN 112988855A CN 202110562020 A CN202110562020 A CN 202110562020A CN 112988855 A CN112988855 A CN 112988855A
- Authority
- CN
- China
- Prior art keywords
- passenger
- data
- travel
- clustering
- station
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000004458 analytical method Methods 0.000 title claims abstract description 37
- 238000007418 data mining Methods 0.000 title claims abstract description 15
- 238000000034 method Methods 0.000 claims abstract description 33
- 230000003203 everyday effect Effects 0.000 claims description 21
- 238000012545 processing Methods 0.000 claims description 15
- 239000000126 substance Substances 0.000 claims description 14
- 238000010606 normalization Methods 0.000 claims description 10
- 239000000203 mixture Substances 0.000 claims description 7
- 238000006243 chemical reaction Methods 0.000 claims description 5
- 230000008859 change Effects 0.000 claims description 3
- 230000003442 weekly effect Effects 0.000 claims description 3
- 230000008569 process Effects 0.000 abstract description 8
- 230000000694 effects Effects 0.000 description 14
- 230000006870 function Effects 0.000 description 8
- 238000004422 calculation algorithm Methods 0.000 description 4
- 230000009286 beneficial effect Effects 0.000 description 3
- 239000011159 matrix material Substances 0.000 description 3
- 230000006399 behavior Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 230000009193 crawling Effects 0.000 description 2
- 238000007405 data analysis Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000008520 organization Effects 0.000 description 2
- 230000002159 abnormal effect Effects 0.000 description 1
- 238000004140 cleaning Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000007667 floating Methods 0.000 description 1
- 238000009472 formulation Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 238000012821 model calculation Methods 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 238000007493 shaping process Methods 0.000 description 1
- 238000012731 temporal analysis Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000001755 vocal effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2465—Query processing support for facilitating data mining operations in structured databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/40—Business processes related to the transportation industry
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Probability & Statistics with Applications (AREA)
- Databases & Information Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Business, Economics & Management (AREA)
- Computational Linguistics (AREA)
- Human Resources & Organizations (AREA)
- Mathematical Physics (AREA)
- Fuzzy Systems (AREA)
- Health & Medical Sciences (AREA)
- Economics (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Marketing (AREA)
- Primary Health Care (AREA)
- Strategic Management (AREA)
- Tourism & Hospitality (AREA)
- General Business, Economics & Management (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention relates to a subway passenger analysis method and system based on data mining, which are characterized in that source data are obtained through passenger travel transaction records, the source data are processed and subjected to multidimensional analysis, subway passenger travel is more accurately and effectively classified, travel tracks of passengers are provided, and reliable datamation basis is provided for planning rail transit on the basis of guaranteeing network large passenger flow operation safety. According to the technical scheme, the final clustering number in the clustering process is selected by calculating the deviation coefficient, the appropriate clustering number can be quickly selected according to the data characteristics, the clustering result which is more consistent with the passenger distribution rule is obtained, the travel track of the passenger is analyzed by using a space-time analysis method, the address and the working place of the passenger are calculated, the approximate distance between the address and the working place is calculated, the travel condition of the subway passenger is deeply analyzed, and a reliable datamation basis is provided for the planning of the rail transit.
Description
Technical Field
The invention relates to the technical field of information data processing, in particular to a subway passenger analysis method and system based on data mining.
Background
The super-large-scale subway network operation faces the heavy traffic pressure under the normal state or the emergency, and the travel space-time trajectory of each passenger in the subway network contains the travel selection characteristics and the activity characteristics of each passenger in a specific time period. The real-time travel track of the passengers can provide detailed data basis for estimating the real-time full load rate of the train, monitoring the real-time distribution of network passenger flow, optimizing a passenger transportation organization scheme, formulating an elastic fare strategy and the like. In addition, the transfer proportion of passengers among different paths and the path selection behaviors of different types of passengers are obtained on the subway network level, the space-time correlation characteristic of passenger flow distribution can be more accurately mined, a quantitative basis is provided for formulation of a station passenger transportation organization scheme and active early warning of line passenger flow, and the intelligent level of active management and control of subway networked operation risks is improved.
With the development of urban rail transit construction in China and the rapid advance of urbanization, how to meet the increasing travel demands of residents through reasonable rail transit design becomes an urgent problem. The traditional orbit trip behavior analysis model and method for directly observing the pedestrian flow and the station throughput are difficult to meet the requirements of more accuracy and refinement. Meanwhile, the travel law of the resident track can well reflect the change of urban social space, and provide valuable reference for reasonable planning of the city.
Disclosure of Invention
Based on the above situation of the prior art, the invention aims to provide a subway passenger analysis method and system based on data mining, which obtain source data through passenger travel transaction records, process and perform multidimensional analysis on the source data, realize more accurate and effective classification of subway passenger travel, provide travel tracks of passengers, and provide reliable datamation basis for planning rail transit on the basis of guaranteeing network large passenger flow operation safety.
In order to achieve the above object, according to one aspect of the present invention, there is provided a subway passenger analysis method based on data mining, comprising the steps of:
acquiring source data of subway passenger travel transactions;
extracting passenger travel characteristic data from the source data as a clustering variable;
carrying out standardized scaling and normalization processing on the clustering variables;
clustering the passenger travel modes by adopting a clustering method, and determining the final clustering number k to obtain a classification result of the passenger travel modes; the number of clusters k is chosen such that the deviation factor isMinimum k, coefficient of deviationComprises the following steps:
wherein the content of the first and second substances,andrespectively is the mean value and standard deviation of the jth dimensionality of the ith clustering center, and L is the dimensionality of data;
and analyzing the travel track of the passenger by using a space-time analysis method.
Further, the passenger travel feature data extracted from the source data includes: the average time of the first arrival, the average time of the final departure, the monthly average travel days, the weekly average travel times, the average time length of the single travel and the station number recorded by swiping the card.
Further, the normalization processing is performed on the clustering variables, and the method comprises the following steps:
for each clustering variable, carrying out standardization conversion on the clustering variable, and standardizing the converted variableComprises the following steps:
Calculating the weight of each dimension:
the normalized data were:
wherein the content of the first and second substances,is the data value of the jth dimension of the ith passenger,is the mean of the j dimension of the current data;is the weight of the j-th dimension, W is the vector of the weights of the dimensions, x isThe vector of (a); l is a numberN is the total number of data according to the dimension.
Further, the clustering method for clustering the passenger travel modes comprises the following steps:
selecting K points as clustering center points;
for each datum, by distance𝐾The distance between each cluster center point is associated with the cluster center point with the closest distance, and all the points associated with the same cluster center point are integrated into a wholeClass (c):
calculating the coordinate average value of each cluster, and moving the cluster center point associated with the cluster to the position of the average value:
wherein the content of the first and second substances,the data is the c data belonging to the ith cluster, and m is the number of the data belonging to the ith cluster;
Further, the selecting a point as a cluster center point includes:
substituting the number of clusters and sample data into Gaussian mixture model
Further, the method for analyzing the travel trajectory of the passenger by using the space-time analysis method comprises the following steps:
calculating the address station and the work place station of the passenger according to the passenger travel characteristic data extracted from the source data;
and calculating the distance between the address and the working place according to the address station and the working place station.
Further, the estimating the address and the working place of the passenger according to the passenger travel feature data extracted from the source data includes:
according to the passenger travel characteristic data, calculating an entry point with the maximum probability of first travel every day, an exit point with the maximum probability of final travel every day and an exit point with the maximum probability of first travel every day;
the station entering with the maximum probability of first trip every day is the same as the station exiting with the maximum probability of final trip every day, and the station is an address station;
the station leaving with the maximum probability of first trip every day is the same as the station leaving with the maximum probability of first trip every day, and the station is a work place station.
Further, the calculating the distance between the address and the work place according to the address site and the work place site includes:
setting the latitude of the station as alpha, the longitude as beta, taking the center of the earth as an origin, taking the connecting line of the center of the earth and a 0-longitude point on the equator as an X axis, taking the connecting line of the center of the earth and a 90-DEG east longitude point on the equator as a Y axis, and taking the connecting line of the center of the earth and a north pole as a Z axis, establishing a coordinate system, and obtaining the coordinates (X, Y, Z) of the station:
x=R×cosα×cosβ
y=R×cosα×sinβ
z=R×sinα;
wherein R is the radius of the earth, north latitude, south latitude, east longitude and west longitude;
calculating the linear distance L between the address station and the work station:
Converting the linear distance L into an arc length C, thereby obtaining the distance between the address and the working place:
C=arcsin(L/2R)*Pi*R/90。
according to another aspect of the invention, a subway passenger analysis system based on data mining is provided, which comprises a data acquisition module, a data processing module, a passenger travel mode classification module and a passenger travel track analysis module; wherein the content of the first and second substances,
the data acquisition module acquires source data of trip transactions of subway passengers; extracting passenger travel characteristic data from the source data as a clustering variable;
the data processing module is used for carrying out standardized scaling and normalization processing on the extracted passenger trip characteristic data;
the passenger trip mode classification module classifies passenger trip modes by adopting a clustering method and determines the final clustering number k, wherein the clustering number k is the deviation coefficientMinimum k, coefficient of deviationComprises the following steps:
wherein the content of the first and second substances,andrespectively is the mean value and standard deviation of the jth dimensionality of the ith clustering center, and L is the dimensionality of data;
the passenger travel track analysis module analyzes the travel track of the passenger by using a time-space analysis method.
In summary, the invention provides a subway passenger analysis method and system based on data mining, which obtains source data through passenger travel transaction records, processes and performs multidimensional analysis on the source data, realizes more accurate and effective classification of subway passengers traveling, provides travel tracks of passengers, and provides reliable datamation basis for planning rail transit on the basis of guaranteeing network large passenger flow operation safety. The technical scheme of the invention has the following beneficial technical effects:
(1) the final clustering number in the clustering process is selected by calculating the deviation coefficient, the quality of the final clustering result is greatly influenced by the selection of the clustering number, the final cost function is large due to the fact that the clustering number is too small, the cost function is very small although the clustering number is too small, the classification number is too large, the actual effect is poor, the clustering number is selected through the deviation coefficient, the proper clustering number can be selected quickly according to the data characteristics, and the clustering result which is more in line with the passenger distribution rule is obtained.
(2) The cluster variables are subjected to standardized scaling and normalization before data analysis, so that the standardization degree of the data to be processed is improved, the characteristic ranges of the cluster variables are not greatly different, and the calculation precision and the convergence speed are influenced.
(3) The initial clustering center points obtained by the Gaussian model can quickly obtain appropriate initial clustering centers, and the appropriate initial clustering center points can improve the algorithm fitting speed, so that the clustering result is more effective and reasonable.
(4) The travel track of the passenger is analyzed by using a space-time analysis method, the address and the working place of the passenger and the approximate distance between the address and the working place are calculated, the travel condition of the passenger in the subway is further analyzed, and a reliable datamation basis is provided for the planning of the rail transit.
Drawings
FIG. 1 is a flow chart of a subway passenger analysis method based on data mining according to the present invention;
fig. 2 is a block diagram of the subway passenger analysis system based on data mining according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings in conjunction with the following detailed description. It should be understood that the description is intended to be exemplary only, and is not intended to limit the scope of the present invention. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present invention.
The technical solution of the present invention will be described in detail below with reference to the accompanying drawings. According to an embodiment of the invention, a subway passenger analysis method based on data mining is provided, a flow chart of the method is shown in fig. 1, and the method comprises the following steps:
the method comprises the steps of obtaining source data of subway passenger trip transactions, and carrying out simple data cleaning on the source data, such as removing abnormal values, null values, extreme values and records which do not accord with corresponding rules. The source data sources of the subway passenger trip transaction comprise an IC card, a two-dimensional code riding record and the like.
Passenger travel characteristic data are extracted from source data and serve as clustering variables, and can be obtained through query of a database programming language sql, for example, and the method comprises the following steps: the average time of the first arrival at the station, the average time of the final departure from the station, the monthly average travel days, the weekly average travel times, the average time length of the single travel and the station number recorded by swiping the card are dimensions of passenger travel characteristic data. Further, data such as an entry site with the maximum first trip probability, an exit site with the maximum first trip probability, an entry site with the maximum final trip probability, an exit site with the maximum final trip probability and the like can be calculated according to the characteristic data. The form and representation of the passenger travel characteristic data are shown in table 1:
table 1 passenger travel characteristic data form and example
And carrying out standardized scaling and normalization processing on the clustering variables. Converting the cluster variable into a shaping or floating point type, wherein the first average inbound timeAnd the average time of the final departureConversion to a format based on minutes, for example: 08:30 to 8 x 60+30= 510. OriginalThe characteristic range of each clustering variable is too large to be beneficial to the model calculation speed. In addition, the values of the clustering variables of the first arrival average time and the final departure average time are too large, and the clustering result is dominated by the two variables, so that the model accuracy and the convergence speed are influenced. Therefore, the normalization processing can be performed on the clustering variables according to the following steps:
for each clustering variable, wherein i represents the number of the clustering variable in the data, namely the data representing the ith passenger, j represents the dimension represented by the clustering variable, the clustering variable is subjected to standardization conversion, and the variable after the standardization conversion is carried outComprises the following steps:
is a specific value for the jth dimension of the ith passenger,is the mean of the j dimension of the current data;
Calculating the weight of each dimension:
the normalized data were:
wherein the content of the first and second substances,is the data value of the jth dimension of the ith passenger,is the mean of the j dimension of the current data;is the weight of the j-th dimension, W is the vector of the weights of the dimensions, x isThe vector of (a); l is the data dimension and n is the total number of data, i.e. data for a total of n passengers.
And clustering the passenger travel modes by adopting a clustering method, for example, classifying clustering variables by adopting a k-means algorithm, and determining the final clustering number k to obtain a classification result of the passenger travel modes.
The optimization goal of the k-means algorithm is to minimize the sum of the distances between all passenger characteristic data and the cluster center point to which they belong, i.e. the cost function J:
j is a cost function, k is the clustering number (namely the number of the passenger travel mode types),is the cluster center coordinate (namely the average value of the coordinates of the characteristic data of each cluster user), m is the total number of sample points (namely the number of passengers participating in clustering),for the sample points (i.e. characteristic data of each passenger),,coordinate points representing characteristic data of passengersAnd (4) the coordinates of the cluster centers to which the clusters belong. The specific clustering process is as follows:
and selecting K points as cluster center points. The determination of the K cluster center points may be performed according to the following steps:
number of clusters𝐾Substituting the sample data into a Gaussian mixture modelWherein, in the step (A),the sum of the proportion of all clusters is 1;the mean vector, i.e. the central coordinate point of each cluster,the covariance matrix is an L multiplied by L matrix, and L is a data dimension;
initializing Gaussian mixture model parameters and calculating the posterior probability generated by each mixture component() I =1, 2, … …, K; j =1, 2, … …, L. After iterating the above formula, calculating a new mean vectorCovariance matrixAnd coefficient of mixing:
For each dataAccording to distance𝐾The distance between each cluster central point is associated with the cluster central point with the nearest distance, and all the points associated with the same cluster central point are clustered into a cluster:
calculating the coordinate mean value of each cluster, moving the cluster center point associated with the group to the position of the mean value:
wherein the content of the first and second substances,the data is the c data belonging to the ith cluster, and m is the number of the data belonging to the ith cluster;
and repeating the steps until the clustering center point is not changed.
The k value selection influences the quality of the final clustering result. If k is too small, the final cost function J is larger, and if k is too large, although the cost function J is very small, the number of classification is too many, so that the actual effect is not good. In the embodiment, the clustering number k is selected by using the deviation coefficient, so that the proper clustering number can be quickly selected according to the data characteristics, and the clustering result more conforming to the passenger distribution rule is obtained.
The number of clusters k is chosen such that the deviation factor isMinimum k, coefficient of deviationComprises the following steps:
wherein the content of the first and second substances,andrespectively are the mean value and the standard deviation of the jth dimensionality of the ith clustering center, and L is the dimensionality of the data.
The travel track of the passenger is analyzed by a time-space analysis method, and the method can be carried out according to the following steps:
according to passenger travel characteristic data extracted from the source data, estimating the address station and the work place station of the passenger:
according to the passenger travel characteristic data, calculating an entry point with the maximum probability of first travel every day, an exit point with the maximum probability of final travel every day and an exit point with the maximum probability of first travel every day; the station entering with the maximum probability of first trip every day is the same as the station exiting with the maximum probability of final trip every day, and the station is an address station; the station leaving with the maximum probability of first trip every day is the same as the station leaving with the maximum probability of first trip every day, and the station is a work place station.
According to the address station and the working site station, calculating the distance between the address and the working site:
setting the latitude of the station as alpha, the longitude as beta, taking the center of the earth as an origin, taking the connecting line of the center of the earth and a 0-longitude point on the equator as an X axis, taking the connecting line of the center of the earth and a 90-DEG east longitude point on the equator as a Y axis, and taking the connecting line of the center of the earth and a north pole as a Z axis, establishing a coordinate system, and obtaining the coordinates (X, Y, Z) of the station:
x=R×cosα×cosβ
y=R×cosα×sinβ
z=R×sinα;
wherein R is the radius of the earth, north latitude, south latitude, east longitude and west longitude;
calculating the linear distance L between the address station and the work station:
Converting the linear distance L into an arc length C, thereby obtaining the distance between the address and the working place:
C=arcsin(L/2R)*Pi*R/90。
the results of classifying the passenger appearance patterns according to the present embodiment can be shown in table 2:
table 2 passenger appearance pattern classification results
Pattern 1 late-out-late-fall
Mode 2 conventional type (the characteristics are not very different from the mean)
Mode 3 Exit type (days and times of the average trip are higher than the mean)
Mode 4 early morning and late evening return type (less station record, longer trip time each time)
According to the scheme of the embodiment, the personal preference of the passenger can be roughly judged.
From the perspective of information theory, the lower the frequency of occurrence of an object, the greater the amount of information that is embedded therein.
f (P) the amount of information corresponding to a P event, P being the probability of that event. The lower the probability, the larger the information content
The personal preference of the passenger can be judged by the event corresponding to the personal trip record with lower proportion frequency. By a method based on spatio-temporal analysis, for example, a concert is being held at time a and a passenger X has an outbound record at a P site near the spot during that time, and the probability of the occurrence at that site is less than the probability at the average site, it can be determined that the passenger is participating in the concert and likes music.
The process of judging that the passenger participates in certain activity and hobby labeling is as follows:
1. the space accords with:
(a) crawling the time, the place and the belonging preference type of various activities of the Internet by using the web crawler; the hobbies can be preliminarily divided into music, drama, sports, parent and child classes, exhibitions, vocals and the like;
(b) crawling the longitude and latitude of each activity place on the subway bus station and the Internet; calculating the linear distance through the longitude and latitude, and recording subway stations four before each activity place.
2. The character and time are in accordance with:
(a) before the activity begins, the passenger has a transaction record of the departure of the passenger at the first four stations away from the activity site;
(b) after the event is finished, the passenger has a transaction record of the passenger who enters at the site before the distance of the event site;
(c) the passenger has no travel transaction record during the course of the activity.
3. Space, time, passenger's final screening that the persona all satisfied:
and judging whether the nearby sites of the activity place are common sites for passenger travel, inquiring the frequency of each site in the travel record of the passenger, marking the sites as the common sites if the frequency is larger than the average frequency, and otherwise, marking the sites as the uncommon sites. Passengers who are frequent stops are excluded and the remaining passengers are marked with personal preferences.
According to another embodiment of the invention, a subway passenger analysis system based on data mining is provided, and the system is configured as shown in fig. 2 and comprises a data acquisition module, a data processing module, a passenger travel mode classification module and a passenger travel track analysis module.
The data acquisition module is used for acquiring source data of subway passenger travel transactions; extracting passenger travel characteristic data from the source data as a clustering variable;
the data processing module is used for carrying out standardized scaling and normalization processing on the extracted passenger trip characteristic data;
the passenger trip mode classification module classifies passenger trip modes by adopting a clustering method and determines the final clustering number k, wherein the clustering number k is the deviation coefficientMinimum k, coefficient of deviationComprises the following steps:
wherein the content of the first and second substances,andrespectively is the mean value and standard deviation of the jth dimensionality of the ith clustering center, and L is the dimensionality of data;
and the passenger travel track analysis module analyzes the travel track of the passenger by using a space-time analysis method.
In summary, the invention relates to a subway passenger analysis method and system based on data mining, which obtains source data through passenger travel transaction records, processes and performs multidimensional analysis on the source data, realizes more accurate and effective classification of subway passengers, provides travel tracks of passengers, and provides reliable datamation basis for planning rail transit on the basis of guaranteeing large network passenger flow operation safety. The technical scheme of the invention has the following beneficial technical effects:
(1) the final clustering number in the clustering process is selected by calculating the deviation coefficient, the quality of the final clustering result is greatly influenced by the selection of the clustering number, the final cost function is large due to the fact that the clustering number is too small, the cost function is very small although the clustering number is too small, the classification number is too large, the actual effect is poor, the clustering number is selected through the deviation coefficient, the proper clustering number can be selected quickly according to the data characteristics, and the clustering result which is more in line with the passenger distribution rule is obtained.
(2) The cluster variables are subjected to standardized scaling and normalization before data analysis, so that the standardization degree of the data to be processed is improved, the characteristic ranges of the cluster variables are not greatly different, and the calculation precision and the convergence speed are influenced.
(3) The initial clustering center points obtained by the Gaussian model can quickly obtain appropriate initial clustering centers, and the appropriate initial clustering center points can improve the algorithm fitting speed, so that the clustering result is more effective and reasonable.
(4) The travel track of the passenger is analyzed by using a space-time analysis method, the address and the working place of the passenger and the approximate distance between the address and the working place are calculated, the travel condition of the passenger in the subway is further analyzed, and a reliable datamation basis is provided for the planning of the rail transit. .
It is to be understood that the above-described embodiments of the present invention are merely illustrative of or explaining the principles of the invention and are not to be construed as limiting the invention. Therefore, any modification, equivalent replacement, improvement and the like made without departing from the spirit and scope of the present invention should be included in the protection scope of the present invention. Further, it is intended that the appended claims cover all such variations and modifications as fall within the scope and boundaries of the appended claims or the equivalents of such scope and boundaries.
Claims (9)
1. A subway passenger analysis method based on data mining is characterized by comprising the following steps:
acquiring source data of subway passenger travel transactions;
extracting passenger travel characteristic data from the source data as a clustering variable;
carrying out standardized scaling and normalization processing on the clustering variables;
clustering the passenger travel modes by adopting a clustering method, and determining the final clustering number k to obtain a classification result of the passenger travel modes; the number of clusters k is chosen such that the deviation factor isMinimum k, coefficient of deviationComprises the following steps:
wherein the content of the first and second substances,andrespectively is the mean value and standard deviation of the jth dimensionality of the ith clustering center, and L is the dimensionality of data;
and analyzing the travel track of the passenger by using a space-time analysis method.
2. The method of claim 1, wherein the passenger travel feature data extracted from the source data comprises: the average time of the first arrival, the average time of the final departure, the monthly average travel days, the weekly average travel times, the average time length of the single travel and the station number recorded by swiping the card.
3. The method of claim 2, wherein normalizing the cluster variables comprises the steps of:
for each clustering variable, carrying out standardization conversion on the clustering variable, and standardizing the converted variableComprises the following steps:
Calculating the weight of each dimension:
the normalized data were:
wherein the content of the first and second substances,is the data value of the jth dimension of the ith passenger,is the mean of the j dimension of the current data;is the weight of the j-th dimension, W is the vector of the weights of the dimensions, x isThe vector of (a); l is the data dimension and n is the total number of data.
4. The method according to claim 3, wherein the clustering method for the passenger travel patterns comprises the steps of:
selecting K points as clustering center points;
for each datum, by distance𝐾The distance between each cluster center point is associated with the cluster center point with the closest distance, and all the points associated with the same cluster center point are integrated into a wholeClass (c):
calculating the coordinate average value of each cluster, and moving the cluster center point associated with the cluster to the position of the average value:
wherein the content of the first and second substances,the data is the c data belonging to the ith cluster, and m is the number of the data belonging to the ith cluster;
6. The method according to claim 5, wherein the travel trajectory of the passenger is analyzed by using a spatiotemporal analysis method, comprising the following steps:
calculating the address station and the work place station of the passenger according to the passenger travel characteristic data extracted from the source data;
and calculating the distance between the address and the working place according to the address station and the working place station.
7. The method of claim 6, wherein estimating the address and the work place of the passenger based on the passenger travel feature data extracted from the source data comprises:
according to the passenger travel characteristic data, calculating an entry point with the maximum probability of first travel every day, an exit point with the maximum probability of final travel every day and an exit point with the maximum probability of first travel every day;
the station entering with the maximum probability of first trip every day is the same as the station exiting with the maximum probability of final trip every day, and the station is an address station;
the station leaving with the maximum probability of first trip every day is the same as the station leaving with the maximum probability of first trip every day, and the station is a work place station.
8. The method of claim 7, wherein estimating the distance between the address and the work site based on the address site and the work site comprises:
setting the latitude of the station as alpha, the longitude as beta, taking the center of the earth as an origin, taking the connecting line of the center of the earth and a 0-longitude point on the equator as an X axis, taking the connecting line of the center of the earth and a 90-DEG east longitude point on the equator as a Y axis, and taking the connecting line of the center of the earth and a north pole as a Z axis, establishing a coordinate system, and obtaining the coordinates (X, Y, Z) of the station:
x=R×cosα×cosβ
y=R×cosα×sinβ
z=R×sinα;
wherein R is the radius of the earth, north latitude, south latitude, east longitude and west longitude;
calculating the linear distance L between the address station and the work station:
Converting the linear distance L into an arc length C, thereby obtaining the distance between the address and the working place:
C=arcsin(L/2R)*Pi*R/90。
9. a subway passenger analysis system based on data mining is characterized by comprising a data acquisition module, a data processing module, a passenger travel mode classification module and a passenger travel track analysis module; wherein the content of the first and second substances,
the data acquisition module acquires source data of trip transactions of subway passengers; extracting passenger travel characteristic data from the source data as a clustering variable;
the data processing module is used for carrying out standardized scaling and normalization processing on the extracted passenger trip characteristic data;
the passenger trip mode classification module classifies passenger trip modes by adopting a clustering method and determines the final clustering number k, wherein the clustering number k is the deviation coefficientMinimum k, coefficient of deviationComprises the following steps:
wherein the content of the first and second substances,andrespectively is the mean value and standard deviation of the jth dimensionality of the ith clustering center, and L is the dimensionality of data;
the passenger travel track analysis module analyzes the travel track of the passenger by using a time-space analysis method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110562020.8A CN112988855A (en) | 2021-05-24 | 2021-05-24 | Subway passenger analysis method and system based on data mining |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110562020.8A CN112988855A (en) | 2021-05-24 | 2021-05-24 | Subway passenger analysis method and system based on data mining |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112988855A true CN112988855A (en) | 2021-06-18 |
Family
ID=76337116
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110562020.8A Pending CN112988855A (en) | 2021-05-24 | 2021-05-24 | Subway passenger analysis method and system based on data mining |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112988855A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113642625A (en) * | 2021-08-06 | 2021-11-12 | 北京交通大学 | Method and system for deducing individual trip purpose of urban rail transit passenger |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105718946A (en) * | 2016-01-20 | 2016-06-29 | 北京工业大学 | Passenger going-out behavior analysis method based on subway card-swiping data |
CN111833229A (en) * | 2020-03-28 | 2020-10-27 | 东南大学 | Travel behavior space-time analysis method and device based on subway dependency |
-
2021
- 2021-05-24 CN CN202110562020.8A patent/CN112988855A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105718946A (en) * | 2016-01-20 | 2016-06-29 | 北京工业大学 | Passenger going-out behavior analysis method based on subway card-swiping data |
CN111833229A (en) * | 2020-03-28 | 2020-10-27 | 东南大学 | Travel behavior space-time analysis method and device based on subway dependency |
Non-Patent Citations (1)
Title |
---|
李飞羽: "城市轨道交通乘客行为特征分析及出行预测", 《中国优秀硕士学位论文全文数据库(电子期刊)工程科技Ⅱ辑》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113642625A (en) * | 2021-08-06 | 2021-11-12 | 北京交通大学 | Method and system for deducing individual trip purpose of urban rail transit passenger |
CN113642625B (en) * | 2021-08-06 | 2024-02-02 | 北京交通大学 | Method and system for deducing individual travel purposes of urban rail transit passengers |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109191896B (en) | Personalized parking space recommendation method and system | |
CN111653097B (en) | Urban trip mode comprehensive identification method based on mobile phone signaling data and containing personal attribute correction | |
CN110942198B (en) | Passenger path identification method and system for rail transit operation | |
CN113159364A (en) | Passenger flow prediction method and system for large-scale traffic station | |
CN110555544B (en) | Traffic demand estimation method based on GPS navigation data | |
CN108806248B (en) | Vehicle travel track division method for RFID electronic license plate data | |
CN113436433B (en) | Efficient urban traffic outlier detection method | |
CN107610282A (en) | A kind of bus passenger flow statistical system | |
CN110727714A (en) | Resident travel feature extraction method integrating space-time clustering and support vector machine | |
Li et al. | Using smart card data trimmed by train schedule to analyze metro passenger route choice with synchronous clustering | |
CN111581325A (en) | K-means station area division method based on space-time influence distance | |
Guo et al. | Exploring potential travel demand of customized bus using smartcard data | |
CN112988855A (en) | Subway passenger analysis method and system based on data mining | |
CN108681741B (en) | Subway commuting crowd information fusion method based on IC card and resident survey data | |
CN107730717B (en) | A kind of suspicious card identification method of public transport based on feature extraction | |
CN108053646B (en) | Traffic characteristic obtaining method, traffic characteristic prediction method and traffic characteristic prediction system based on time sensitive characteristics | |
CN113408833A (en) | Public traffic key area identification method and device and electronic equipment | |
CN112733890A (en) | Online vehicle track clustering method considering space-time characteristics | |
CN111292099A (en) | Intelligent station anti-ticket-swiping method and anti-ticket-swiping system | |
CN110610446A (en) | County town classification method based on two-step clustering thought | |
Nagy et al. | Land-use zone estimation in public transport planning with data mining | |
Thiagarajan et al. | Identification of passenger demand in public transport using machine learning | |
Zhang et al. | Research on taxi driver strategy game evolution with carpooling detour | |
CN114742131A (en) | Method for identifying urban excessive tourism area based on pattern mining | |
CN112988849B (en) | Traffic track mode distributed mining method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210618 |
|
RJ01 | Rejection of invention patent application after publication |