CN112070529A - Passenger carrying hotspot parallel prediction method, system, terminal and computer storage medium - Google Patents
Passenger carrying hotspot parallel prediction method, system, terminal and computer storage medium Download PDFInfo
- Publication number
- CN112070529A CN112070529A CN202010857329.5A CN202010857329A CN112070529A CN 112070529 A CN112070529 A CN 112070529A CN 202010857329 A CN202010857329 A CN 202010857329A CN 112070529 A CN112070529 A CN 112070529A
- Authority
- CN
- China
- Prior art keywords
- data
- passenger
- parallel
- svm algorithm
- track data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 64
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 106
- 238000007781 pre-processing Methods 0.000 claims abstract description 18
- 230000006870 function Effects 0.000 claims description 39
- 238000004891 communication Methods 0.000 claims description 11
- 239000013598 vector Substances 0.000 claims description 9
- 238000001914 filtration Methods 0.000 claims description 8
- 238000004590 computer program Methods 0.000 claims description 7
- 238000013507 mapping Methods 0.000 claims description 4
- 238000004364 calculation method Methods 0.000 abstract description 5
- 238000012706 support-vector machine Methods 0.000 description 31
- 238000012545 processing Methods 0.000 description 14
- 230000008569 process Effects 0.000 description 9
- 238000012163 sequencing technique Methods 0.000 description 6
- 238000013075 data extraction Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 5
- 238000005457 optimization Methods 0.000 description 5
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000002790 cross-validation Methods 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 240000000797 Hibiscus cannabinus Species 0.000 description 1
- 235000002905 Rumex vesicarius Nutrition 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 239000008358 core component Substances 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 238000005265 energy consumption Methods 0.000 description 1
- 238000003912 environmental pollution Methods 0.000 description 1
- 239000002360 explosive Substances 0.000 description 1
- 238000012417 linear regression Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/02—Marketing; Price estimation or determination; Fundraising
- G06Q30/0201—Market modelling; Market analysis; Collecting market data
- G06Q30/0202—Market predictions or forecasting for commercial activities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2465—Query processing support for facilitating data mining operations in structured databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/29—Geographical information databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/245—Classification techniques relating to the decision surface
- G06F18/2451—Classification techniques relating to the decision surface linear, e.g. hyperplane
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/40—Business processes related to the transportation industry
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- Business, Economics & Management (AREA)
- General Physics & Mathematics (AREA)
- Strategic Management (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- Finance (AREA)
- Development Economics (AREA)
- Accounting & Taxation (AREA)
- Economics (AREA)
- Marketing (AREA)
- General Business, Economics & Management (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Entrepreneurship & Innovation (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Fuzzy Systems (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Game Theory and Decision Science (AREA)
- Probability & Statistics with Applications (AREA)
- Mathematical Physics (AREA)
- Remote Sensing (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Human Resources & Organizations (AREA)
- Primary Health Care (AREA)
- Tourism & Hospitality (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a method, a system, a terminal and a computer storage medium for passenger-carrying hot spot parallel prediction, wherein the method comprises the following steps: acquiring moving track data of a vehicle; preprocessing the moving track data to obtain hot spot data of passengers; constructing a parallel GS-SVM algorithm according to the hot spot data of the passengers; executing a GS-SVM algorithm based on the RDD, and outputting a prediction result; the method comprises the steps of obtaining movement track data, preprocessing the movement track data, obtaining passenger hot spot data, further constructing a parallel GS-SVM algorithm, combining with RDD under Spark to realize the parallel GS-SVM algorithm, utilizing the capability of the GS-SVM algorithm for capturing nonlinear information, outputting a prediction result, effectively improving prediction accuracy, robustness and instantaneity, and solving the technical problems of distributed storage and parallel calculation of large data of the movement track.
Description
Technical Field
The invention relates to the field of passenger hotspot prediction based on big data of a movement track, in particular to a passenger carrying hotspot parallel prediction method, a system, a terminal and a computer storage medium.
Background
In the big data driven intelligent traffic era, the passenger hot spot prediction is important for the profitability of a driver (such as a taxi driver), and the passenger hot spot prediction is carried out on the passenger hot spots in the future time interval by utilizing the number of passengers on the hot spots of the historical passengers, so that the driver is helped to quickly find the passengers, the cruising duration is shortened, the cost is reduced, and the energy consumption and the environmental pollution are reduced.
The traditional passenger hot spot prediction algorithm has the defects of low prediction precision, poor adaptability and the like, and particularly along with the explosive growth of traffic big data, the traditional serial algorithm still has limitations in passenger hot spot prediction based on the traditional single-machine centralized computing platform, so that the technical problems of memory consumption, high I/O (input/output) overhead, low processing efficiency and poor expandability are easily caused. Meanwhile, in the prior art, only the linear relation is considered in passenger hot spot prediction, so that the technical problem of low prediction accuracy is caused.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a passenger carrying hotspot parallel prediction method, a passenger carrying hotspot parallel prediction system, a passenger carrying hotspot parallel prediction terminal and a computer storage medium, aiming at the defects of the prior art, the passenger carrying hotspot parallel prediction method, the passenger carrying hotspot parallel prediction system, the passenger carrying hotspot parallel prediction terminal and the computer storage medium have the characteristics of high prediction precision, high processing speed, strong expandability and the like, and can effectively solve the technical problems of high memory consumption and time consumption in passenger hotspot distributed storage and parallel computation based on large moving track data.
The technical scheme for solving the technical problems is as follows: a passenger carrying hotspot parallel prediction method, a system, a terminal and a computer storage medium comprise the following steps:
acquiring moving track data of a vehicle;
preprocessing the movement track data to obtain passenger hot spot data;
constructing a parallel GS-SVM algorithm according to the passenger hot spot data;
and executing the GS-SVM algorithm based on the RDD, and outputting a prediction result.
The invention has the beneficial effects that: the method comprises the steps of obtaining moving track data, obtaining passenger hot spot data after preprocessing, further constructing a parallel GS-SVM algorithm, further combining RDD under Spark to realize the parallel GS-SVM algorithm, utilizing the capability of the GS-SVM algorithm for capturing nonlinear information, outputting a prediction result, effectively improving prediction accuracy, robustness and real-time performance, and solving the technical problems of distributed storage and parallel calculation of large moving track data.
Further, the preprocessing the movement trajectory data to obtain passenger hotspot data includes:
extracting track data of which the vehicle operation state is changed from 0 to 1 according to the movement track data, and storing the track data of which the operation state is 1, wherein 0 represents empty vehicles, and 1 represents passenger carrying;
according to the track data with the operation state of 1, carrying out grid division on the road network;
and counting the track data with the operation state of 1 in the preset time interval in the grid to obtain the hot spot data of the passengers.
The technical problems of high memory consumption and time consumption in distributed storage and parallel calculation of the big moving track data are solved by sequentially carrying out data extraction, data sorting, road network meshing and data statistics on the big moving track data on the basis of a Spark parallel processing framework under a Hadoop distributed computing platform.
Further, the extracting trajectory data of the vehicle operation state changing from 0 to 1 according to the movement trajectory data and saving trajectory data of which the operation state is 1 includes:
converting the moving track data into an RDD elastic distribution data set in Spark by reading the moving track data in the HDFS file;
the RDD is segmented, invalid data are filtered, then required fields are extracted, the required fields comprise vehicle IDs, operation states, time, longitude and latitude, the track data when the operation states of the same vehicle IDs are continuously 0, 1 and 1 are determined according to vehicle ID sorting, the last track data with the operation state of 1 is stored, and sorting is carried out according to time.
The method has the advantages that the moving track data are screened and sorted through the RDD, so that passenger hot-spot numbers are accurately obtained according to the track data.
Further, the mesh division of the road network according to the trajectory data with the operation state of 1 includes:
filtering the track data according to the selected target longitude range and the selected target latitude range;
and performing grid division on the longitude and latitude of the filtered track data according to a preset grid step length.
The method has the advantages that the track data are filtered according to the longitude and the latitude, the track data grids are further divided, and therefore the statistics of passenger hot spot data in the same follow-up grids is facilitated.
Further, the counting the trajectory data of which the operation state is 1 in the preset time interval in the grid to obtain the passenger hotspot data includes:
dividing the time of day into preset time intervals at intervals of preset duration, and counting the number of passengers getting on the same grid in the preset time intervals.
The method has the advantages that the method divides each day into preset time intervals, and counts the passenger number in the preset time intervals by the road network gridding, so that the passenger number in the hot spot of the passenger can be accurately positioned, and the robustness and the accuracy of the passenger hot spot prediction can be improved.
Further, the constructing of the parallel GS-SVM algorithm according to the passenger hotspot data comprises:
searching an optimal parameter combination of a Radial Basis Function (RBF) kernel function by using a grid search method;
and applying the optimal parameter combination to an SVM algorithm to obtain a GS-SVM algorithm.
The method has the advantages that built-in radial basis function parameters of the SVM algorithm are optimized based on a grid search method, the GS-SVM algorithm is determined according to the optimal parameter combination, and then the number of passengers on the passenger hot spots in the future time interval is predicted by combining the GS-SVM algorithm with real passenger hot spot data, so that the real-time performance and expandability of passenger hot spot prediction are improved, and the technical problem that the passenger hot spot prediction precision is low due to the fact that the existing passenger hot spot prediction algorithm only considers linear information and ignores the nonlinear problem of the passenger hot spots is solved.
Further, the step of applying the optimal parameter combination to the SVM algorithm to obtain the GS-SVM algorithm includes:
the optimal parameter of the RBF kernel function is penalty factor C of 900, kernel parameter gamma of 0.001, and the GS-SVM algorithm is as follows:
(x) is the predicted value returned, y is the corresponding true value, Φ (x) is the nonlinear mapping function, the linear insensitive loss function, SV is the support vector, N is the linear insensitive loss functionnsvTo support the number of vectors, αiIs Lagrangian factor, K (x)i,xj)=Φ(xi)Φ(xj) I ═ 1, 2,. l for the RBF kernel.
The method has the advantages that the grid search method is used for optimizing built-in radial basis function parameters of the SVM algorithm, when C is 900 and gamma is 0.001, the prediction accuracy of the SVM algorithm is superior to other parameter combinations when the C is applied to the RBF kernel function of the SVM algorithm, the performance of the parallel SVM algorithm is globally optimal, and then prediction is performed by combining the obtained GS-SVM algorithm with passenger hotspots in history and current time intervals, so that the real-time performance and expandability of passenger hotspot prediction are improved, the problem that the existing passenger hotspot prediction algorithm only considers linear information and ignores the passenger hotspots is solved, the problem that the existing passenger hotspot prediction algorithm is not fully considered, the nonlinear algorithm is easily influenced by parameters and is easily trapped into local optimization is solved, and the technical problem that the passenger hotspot prediction accuracy is low is caused.
In order to solve the technical problem, an embodiment of the present invention further provides a system for predicting passenger-carrying hotspots in parallel, including a data acquisition module, a data preprocessing module, an algorithm establishing module, and a prediction module;
the data acquisition module is used for acquiring the moving track data of the vehicle;
the data preprocessing module is used for preprocessing the movement track data to obtain hot spot data of passengers;
the algorithm establishing module is used for establishing a parallel GS-SVM algorithm according to the passenger hot spot data;
and the prediction module is used for executing the GS-SVM algorithm based on the RDD, outputting a prediction result and outputting the prediction result.
In order to solve the above technical problem, an embodiment of the present invention further provides a terminal, where the terminal includes a processor, a memory, and a communication bus;
the communication bus is used for realizing connection communication between the processor and the memory;
the processor is configured to execute one or more computer programs stored in the memory to implement the steps of the passenger-carrying hot spot parallel prediction method as described above.
To solve the above technical problem, embodiments of the present invention further provide a computer storage medium storing one or more computer programs, which are executable by one or more processors to implement the steps of the passenger hotspot prediction method described above.
Drawings
Fig. 1 is a flowchart of a concurrent prediction method for a hotspot with a passenger in the embodiment of the present invention;
FIG. 2 is a flowchart of a method for obtaining hot spot data of passengers according to an embodiment of the present invention;
FIG. 3 is a framework diagram of a passenger hot spot parallel prediction implementation provided by an embodiment of the present invention;
fig. 4 is a functional diagram for respectively implementing distributed storage and parallel computation based on HDFS and Spark according to an embodiment of the present invention;
FIG. 5 is a diagram of an HDFS process communication framework according to an embodiment of the present invention;
fig. 6 is a flowchart of a spare computing task according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of a terminal according to an embodiment of the present invention;
fig. 8 is a schematic structural diagram of a concurrent prediction system for a hotspot with passenger provided by an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments that can be derived by one of ordinary skill in the art from the embodiments given herein are intended to be within the scope of the present invention.
Example 1
The noun explanations in the present invention are shown in Table 1:
TABLE 1
Apache Hadoop architecture:
apache Hadoop is a reliable and extensible open source distributed computing architecture that can provide a stable and reliable interface for applications in a cluster consisting of a large amount of inexpensive hardware. The method makes full use of the computing and storing capability of the cluster, constructs a scalable and extensible large-data batch processing architecture with high reliability and strong fault tolerance, and realizes distributed storage and parallel computing of large-scale data.
HDFS and MapReduce are core components of the Hadoop architecture, and are also open source implementations based on GFS (Google File System) and Google MapReduce. Hadoop respectively realizes distributed storage and parallel computation through HDFS and MapReduce, and NameNode and DataNode complete HDFS function, JobTracker and TaskTracker complete MapReduce function, and refer to FIG. 4. In addition, Hadoop also includes Hadoop Common, Hadoop YARN, Ambari, Avro, Cassandra, Chukwa, HBase, Hive, Mahout, Pig, Spark, Tez, and ZooKeeper, among others.
Hadoop Distributed File System (HDFS):
the HDFS (Hadoop Distributed File System) is a Distributed File System that can be deployed on inexpensive hardware to realize high-throughput parallel data access, and can provide high-performance, strong fault-tolerant, and reliable storage of large-scale data. Referring to fig. 5, the HDFS adopts a Master/Slave mode, and is composed of a NameNode (manager) node, a plurality of DataNode (worker) nodes, and an HDFS Client, and realizes communication among the NameNode, the DataNode, and the HDFS Client processes through an RPC mechanism of Hadoop.
Spark parallel programming model:
spark is a parallel programming model (often called "Spark parallel processing framework") that can process large-scale datasets and perform parallel computing tasks on Hadoop clusters consisting of hundreds or thousands of servers. The main idea is rdd (resource Distributed dataset), which stores all calculated data in a Distributed memory. The Cluster Manager is mainly used for acquiring external services of resources on the Cluster; running the nodes of the application codes in the cluster by using a Worker Node; the Executor is a process for starting applications on a worker node, is responsible for running tasks and storing data in a memory or a disk, and each application has independent executors. And after executing the Task, returning the result to the Driver end.
Referring to fig. 1, a method for predicting a hot spot of a passenger carrier in parallel, as shown in fig. 1, includes the following steps:
s1: acquiring moving track data of a vehicle;
s2: preprocessing the moving track data to obtain hot spot data of passengers;
s3: constructing a parallel GS-SVM algorithm according to the hot spot data of the passengers;
s4: and executing a GS-SVM algorithm based on the RDD, and outputting a prediction result.
The method has the advantages that the parallel GS-SVM algorithm is constructed by obtaining the moving track data, the parallel GS-SVM algorithm is realized through RDD based on a Spark parallel distributed computing platform, the prediction result is output, the prediction accuracy, robustness and real-time performance are effectively improved, and the technical problems of high memory consumption and time consumption in distributed storage and parallel computing of the moving track big data are solved.
In the embodiment, vehicle GPS track data is adopted, passenger hotspots are accurately positioned through road network gridding, the number of passengers on the passenger hotspots in the grid is counted by a fixed grid, a sampling time interval is determined, and the number of passengers on the passenger hotspots in the same grid preset time interval is obtained. By using the parallel GS-SVM algorithm, the passenger hot spots in the future time interval of any grid can be predicted according to the passenger hot spots in the historical time interval and the current time interval in the grid.
Under a Hadoop distributed computing platform, based on a Spark parallel processing framework, as shown in fig. 2, S2 specifically includes:
s201: extracting track data of which the operation state is changed from 0 to 1 according to the moving track data, and storing the track data of which the operation state is 1, wherein 0 represents empty vehicles and 1 represents passenger carrying;
s202: carrying out mesh division on the road network according to the data with the operation state of 1;
s203: and counting the track data with the operation state of 1 in a preset time interval in the grid to obtain the hot spot data of the passengers.
In the embodiment, the moving track data is subjected to data extraction, data sorting, road network meshing and data statistics in sequence, so that the influence of error data on the precision of the GS-SVM algorithm is reduced.
In this embodiment, in S2, in a Hadoop distributed computing platform, the specific process of S201 based on the Spark parallel processing framework is as follows:
the method comprises the steps of converting moving track data into an RDD elastic distribution data set in Spark by reading the moving track data in an HDFS file, slicing the RDD, filtering out invalid data, extracting required fields, sequencing the required fields according to vehicle IDs, determining track data with the same vehicle IDs, wherein the operating states of the track data are 0, 1 and 1 continuously, storing the last track data with the operating state of 1, and sequencing the track data according to time. Namely, track data of the same vehicle ID, wherein the operation states of the same vehicle ID are continuously empty, passenger carrying and passenger carrying; for example, the movement trajectory data of a-B is assumed to be in the trajectory of the vehicle ID1 at the point a-B, where a-a1 is empty, a1-a2 is passenger-carrying, a2-A3 is passenger-carrying, and A3-B is empty, and the trajectory data corresponding to a2-A3 is stored.
In this embodiment, in S2, in the Hadoop distributed computing platform, the specific process of S202 based on the Spark parallel processing framework is as follows:
filtering the stored track data according to the selected target longitude range and the selected target latitude range;
and performing grid division on the longitude and latitude of the filtered track data according to a preset grid step length.
For example, a target longitude range (39.8283918700-39.9909153300) and a target latitude range (116.2611551300-116.4954361600) are selected, the track data corresponding to the A2-A3 are filtered, the track data which do not belong to the target longitude range and the target latitude range are filtered, and then the longitude and latitude of the filtered track data are subjected to grid division. The preset grid step length can be flexibly adjusted according to actual requirements, for example, the preset grid step length is 0.01 degrees, and the grid division mainly divides the track data obtained after filtering into 10 × 10 grids.
In S2, in the Hadoop distributed computing platform, the specific process of the Spark parallel processing framework-based method in S203 is as follows:
and dividing the time of day according to a preset time interval, and counting the number of passengers getting on the same grid in the preset time interval.
Wherein the number of passengers boarding can be determined from a change from 0 to 1 in accordance with the vehicle GPS operation state; for example, dividing the day by 15 minutes, counting passengers on every 15 minutes in the same grid, as respectively counting 8: 00-8: 15. 8: 15-8: 30 passengers in equal time intervals; assuming that the track data corresponding to A2-A3 and the track data corresponding to C1-C2 are located in the same grid, counting how many passengers are on the bus by the track data corresponding to A2-A3 and C1-C2 every 15 minutes in the day in the grid, and further determining the number of passengers on the bus in each grid.
In this embodiment, S3 specifically includes:
the parameters of the kernel function built in the SVM algorithm are not the optimal parameters, so that the optimal parameter combination of the RBF kernel function is searched by utilizing a grid search method;
and applying the optimal parameter combination to the SVM algorithm to obtain the GS-SVM algorithm.
Firstly, the number of passengers on a passenger hotspot is accurately positioned through road network gridding, and the robustness and the accuracy of passenger hotspot prediction are improved; secondly, a grid search method optimizes built-in radial basis function parameters of an SVM algorithm, and then predicts the number of passengers getting on a passenger hotspot at a future time interval by combining real passenger hotspot data based on a GS-SVM algorithm so as to improve the real-time performance and expandability of passenger hotspot prediction, and solve the technical problems that the conventional passenger hotspot prediction algorithm only considers linear information and ignores nonlinear information, and the nonlinear algorithm is easily influenced by parameters and is easy to fall into local optimum without fully considering the nonlinear algorithm, so that the passenger hotspot prediction precision is low.
The SVM algorithm is also called a support vector machine, has obvious advantages in processing nonlinear problems, can transform a low-dimensional nonlinear problem into a linear problem by projecting a high-dimensional space, and when the SVM algorithm is used for regression prediction, the linear regression function is as follows:
f(x)=wΦ(x)+b (1)
where Φ (x) is the nonlinear mapping function, w is the weight vector, and b is the classification threshold.
Defining a linear insensitive loss function
Where f (x) is the predicted value returned by the regression function, where y is the corresponding true value.
To determine the values of w, b, a relaxation variable ξ is introducedi,ξi *。
Wherein C is a penalty factor, and the larger C is, the larger the sample penalty of the training error is; the error requirement of the regression function is specified, and smaller error means smaller error of the regression function.
The Lagrange Largrange function is introduced as follows to convert it into dual form solution:
wherein, K (x)i,xj)=Φ(xi)Φ(xj) As a RBF kernel function, αiIs the lagrange factor.
In this embodiment, a grid search method is used to find an optimal parameter combination of the RBF kernel function C (penalty factor) and γ (kernel parameter). Specifically, a grid search method is utilized to traverse all grids to search for the optimal combination of radial basis function parameters, cross validation is utilized to search for the overall optimal parameter combination, and the SVM algorithm is effectively prevented from falling into local optimization; in this embodiment, C ═ 100,30,050,07,009, γ ═ 0.001,0.003,0.005,0.007,0.009, etc. are defined.
Obtaining α ═ α from formula (4)1,α2,…,αl]Is α*=[α1 *,α2 *,…,αl *]Then there is
Wherein SV is a support vector, NnsvThe number of the support vectors.
Combining the above formulas (1) and (5), the grid search method is introduced to obtain an optimal parameter combination C of 900 and γ of 0.001, and an optimized GS-SVM algorithm is obtained as follows:
in the present embodiment, when the optimal parameter combination C is 900 and γ is 0.001, the prediction accuracy of the SVM algorithm is better than that of other parameter combinations when applied to the RBF kernel function of the SVM algorithm.
In the embodiment, the capacity of capturing nonlinear information by using a GS-SVM algorithm is combined with historical data, and the number of passengers getting on the train in a preset time interval is predicted; for example, in conjunction with the number of passengers boarding within the current 15 minutes in the grid, the number of passenger hotspots within 15 minutes for the next time interval within the same grid is predicted.
In the present embodiment, S4 includes:
step 1: the RBF function C (penalty factor) and gamma (nuclear parameter) optimal parameter combination of the SVM algorithm is found through a Spark parallel grid search method, so that the SVM algorithm has good generalization capability.
Step 2: and (3) applying the optimal parameter combination of the Spark parallel grid search method obtained in the step (1) to an RBF (radial basis function) of an SVM (support vector machine) algorithm to obtain a Spark-based parallel GS-SVM algorithm.
And step 3: and predicting passenger hot spots in the same grid in a future time interval by utilizing the capability of capturing nonlinear information by using a Spark-based parallel GS-SVM algorithm.
As shown in fig. 6, vehicle GPS data is converted into an RDD elastic distribution data set in Spark, a grid search range C is determined through a large number of experiments, where [10, 03, 00,5, γ ═ 0.001,0.003,0.005,0.007,0.009], all grids are traversed by Spark parallel grid search to find an optimal combination of RBF function C (penalty factor) and γ (kernel parameter) of the SVM algorithm, and then a GS-SVM algorithm is obtained, passenger number of passengers in history and current time interval is input into the GS-SVM algorithm, and passenger number of heat points in future time interval in the same grid is predicted through the GS-SVM algorithm.
Under a Hadoop distributed computing platform, algorithm implementation is carried out based on a Spark parallel processing framework and combined with RDD, the technical problems of distributed storage and parallel computing of big data of a moving track are solved, an optimal parameter combination searched by a grid search method is applied to an SVM algorithm, prediction is carried out by combining passenger hotspots at historical and current time intervals, so that the real-time performance and expandability of passenger hotspot prediction are improved, and the technical problems that the existing passenger hotspot prediction only considers linear information and ignores nonlinear information, and the nonlinear algorithm is not fully considered, is easily influenced by parameters and easily falls into local optimization, and accordingly the passenger hotspot prediction precision is low are solved.
The implementation principle of the embodiment is as follows: referring to fig. 3, first, data is acquired. And collecting GPS track data generated by the vehicle to obtain big moving track data. Second, data is preprocessed. Obtaining a passenger hotspot sequence through data extraction, data sorting, road network gridding and data statistics; specifically, based on a Spark parallel distributed computing platform, firstly, data is converted into an RDD (resource description device) elastic distribution data set in Spark, then, RDD is sliced, invalid data are filtered, required fields (such as vehicle ID, operation state, time, longitude and latitude) are extracted, and sequencing is carried out according to the vehicle ID; secondly, searching tracks (representing the occurrence of a passenger carrying event) with the same vehicle ID (identity) and continuous operation states of 0 (empty vehicle), 1 (passenger carrying) and 1 (passenger carrying), then reserving the vehicle ID, the operation state, the time, the longitude and the latitude with the third operation state of 1, and then sequencing according to the time; and finally, selecting longitudes and latitudes within a certain range, filtering the stored track data result within the range, carrying out grid division on the longitudes and latitudes of the filtered track data, dividing the time data of one day into time sequence data with a time interval of 15 minutes, and counting the number of passengers on the same grid within the time interval of 15 minutes so as to solve the problems of distributed storage, parallel calculation and the like of the large-scale mobile track data. Third, data modeling. And constructing a parallel GS-SVM algorithm based on a Spark parallel processing frame so as to improve the robustness and accuracy of hot spot prediction of passengers. The algorithm optimizes a Radial Basis Function (RBF) of a Support Vector Machine (SVM) by using a grid search method, seeks a global optimal parameter combination by using cross validation, and combines passenger load numbers of historical time intervals to predict passenger hot load numbers of future time intervals. And finally, realizing the algorithm. Based on a Spark parallel processing framework, the parallelization of the GS-SVM algorithm is realized by combining RDD, so that the real-time performance and the expandability of the hot spot prediction of passengers are improved.
Example 2
The present embodiment provides a terminal, as shown in fig. 7, which includes a processor 701, a memory 702, and a communication bus 703;
the communication bus 703 is used for realizing connection communication between the processor 701 and the memory 702;
the processor 701 is configured to execute one or more computer programs stored in the memory 702 to implement the steps of the passenger-carrying hot spot parallel prediction method in the foregoing embodiments, which are not described herein again.
On the basis of embodiment 1, this embodiment provides a passenger-carrying hotspot parallel prediction system, as shown in fig. 8, including a data acquisition module 801, a data preprocessing module 802, an algorithm establishing module 803, and a prediction module 804;
the data acquisition module 801 is used for acquiring movement track data of a vehicle;
the data preprocessing module 802 is configured to preprocess the movement trajectory data to obtain hot spot data of the passenger;
the algorithm establishing module 803 is used for establishing a parallel GS-SVM algorithm according to the passenger hot spot data;
the prediction module 804 is configured to execute a GS-SVM algorithm based on the RDD and output a prediction result.
The method has the advantages that the large data of the moving track are obtained to construct a parallel GS-SVM algorithm, the parallel GS-SVM algorithm is realized by combining with the RDD under the Spark framework, the prediction result is output, the prediction precision is effectively improved, and the technical problems of distributed storage and parallel calculation of the large data of the moving track are solved.
The data preprocessing module 802 is specifically configured to:
extracting track data of which the vehicle operation state is changed from 0 to 1 according to the moving track data, and storing the track data of which the operation state is 1, wherein 0 represents empty vehicle and 1 represents passenger carrying;
carrying out mesh division on the road network according to the track data with the operation state of 1;
and counting the track data with the operation state of 1 in a preset time interval in the grid to obtain the hot spot data of the passengers.
And the moving track data is subjected to data extraction, data sorting, road network gridding and data statistics in sequence, so that the influence of error data on the SVM algorithm is reduced.
The process of obtaining the passenger hot spot based on the Spark parallel processing framework by the data preprocessing module under the Hadoop distributed computing platform is as follows:
converting the moving track data into an RDD elastic distribution data set in Spark by reading the moving track data in the HDFS file;
slicing the RDD, filtering invalid data, extracting required fields, wherein the required fields comprise vehicle IDs, operation states, time, longitude and latitude, sequencing according to the vehicle IDs, determining track data with the same vehicle IDs and continuous operation states of 0, 1 and 1, storing the last track data with the operation state of 1, and sequencing according to the time;
filtering the stored track data according to the selected target longitude range and the selected target latitude range;
performing grid division on the longitude and latitude of the filtered track data according to a preset grid step length;
and dividing the time of day according to a preset time interval, and counting the number of passengers getting on the same grid in the preset time interval.
And the moving track data is subjected to data extraction, data sorting, road network gridding and data statistics in sequence, so that the influence of error data on the SVM algorithm is reduced.
Further, the specific process of the algorithm establishing module 803 establishing the parallel GS-SVM algorithm according to the training data is as follows:
searching an optimal parameter combination of a Radial Basis Function (RBF) kernel function by using a grid search method;
and applying the optimal parameter combination to the SVM algorithm to obtain the GS-SVM algorithm.
And predicting passenger hot spots in 15 minutes at time intervals in the same grid by utilizing the capability of capturing nonlinear information by using a Spark-based parallel GS-SVM algorithm.
Firstly, the number of passengers on a passenger hotspot is accurately positioned through road network gridding, and the robustness and the accuracy of passenger hotspot prediction are improved; secondly, built-in radial basis function parameters of the SVM algorithm are optimized based on a grid search method, the number of passengers getting on a hot spot at a future time interval is predicted by combining real passenger hot spot data, so that the real-time performance and expandability of passenger hot spot prediction are improved, and the technical problem that the passenger hot spot prediction precision is low due to the fact that the existing passenger hot spot prediction method only considers linear information and ignores nonlinear information and does not fully consider that the nonlinear algorithm is easily influenced by parameters and is easy to fall into local optimization is solved.
The algorithm establishing module 803 applies the optimal parameter combination to the SVM algorithm to obtain the GS-SVM algorithm, which includes:
based on a number of experiments, C ═ 100,300,500,700,900, γ ═ 0.001,0.003,0.005,0.007,0.009, were defined, and the SVM algorithm performance reached global optimum when the parameters were combined to C ═ 900 and γ ═ 0.001.
The GS-SVM algorithm is as follows:
(x) is the predicted value returned, y is the corresponding true value, Φ (x) is the nonlinear mapping function, the linear insensitive loss function, SV is the support vector, N is the linear insensitive loss functionnsvTo support the number of vectors, αiIs Lagrangian factor, K (x)i,xj)=Φ(xi)Φ(xj) I ═ 1, 2,. l for the RBF kernel.
The embodiments of the present invention further provide a computer storage medium, where the computer storage medium stores one or more computer programs, and the one or more computer programs may be executed by one or more processors to implement the steps of the method for predicting a hot spot of a passenger carrier in parallel in each of the above embodiments, which are not described herein again.
The beneficial effect of adopting the further scheme is that: the method has the advantages that built-in radial basis function parameters of the SVM algorithm are optimized by using a grid search method, prediction is carried out by combining historical passenger hotspots and passenger hotspots at the current time interval, so that the real-time performance and expandability of passenger hotspot prediction are improved, the non-linear problem that the existing passenger hotspot prediction algorithm only considers linear information and ignores the passenger hotspots is solved, and the technical problem that the passenger hotspot prediction precision is low because the non-linear algorithm is easily influenced by the parameters and falls into local optimization is not fully considered.
The technical solutions provided by the embodiments of the present invention are described in detail above, and the principles and embodiments of the present invention are explained in this patent by applying specific examples, and the descriptions of the embodiments above are only used to help understanding the principles of the embodiments of the present invention; also, while the present invention has been described with respect to particular embodiments and with reference to certain preferred embodiments thereof, it will be understood by those skilled in the art that the foregoing descriptions of the present invention are provided for illustration and not for the purpose of limiting the invention as defined by the appended claims.
Claims (10)
1. A passenger carrying hotspot parallel prediction method is characterized by comprising the following steps:
acquiring moving track data of a vehicle;
preprocessing the movement track data to obtain passenger hot spot data;
constructing a parallel GS-SVM algorithm according to the passenger hot spot data;
and executing the GS-SVM algorithm based on the RDD, and outputting a prediction result.
2. The passenger-carrying hotspot parallel prediction method of claim 1, wherein the preprocessing the movement trajectory data to obtain passenger hotspot data comprises:
extracting track data of which the vehicle operation state is changed from 0 to 1 according to the movement track data, and storing the track data of which the operation state is 1, wherein 0 represents empty vehicles, and 1 represents passenger carrying;
according to the track data with the operation state of 1, carrying out grid division on the road network;
and counting the track data with the operation state of 1 in the preset time interval in the grid to obtain the hot spot data of the passengers.
3. The passenger-carrying hotspot parallel prediction method according to claim 2, wherein the extracting trajectory data of the vehicle operation state from 0 to 1 according to the movement trajectory data and saving trajectory data of which the operation state is 1 comprises:
converting the moving track data into an RDD elastic distribution data set in Spark by reading the moving track data in the HDFS file;
the RDD is segmented, invalid data are filtered, then required fields are extracted, the required fields comprise vehicle IDs, operation states, time, longitude and latitude, the track data with the same vehicle IDs and continuous operation states of 0, 1 and 1 are determined according to vehicle ID sorting, the last vehicle ID with the operation state of 1, the operation states, the time, the longitude and the latitude are stored, and sorting is carried out according to the time.
4. The method according to claim 2, wherein the grid division of the road network according to the trajectory data with the operation state of 1 comprises:
filtering the stored track data according to the selected target longitude range and the selected target latitude range;
and performing grid division on the longitude and latitude of the filtered track data according to a preset grid step length.
5. The passenger-carrying hotspot parallel prediction method of claim 4, wherein the counting of trajectory data with an operation state of 1 in a preset time interval in the grid comprises:
and dividing the time of day according to the preset time interval, and counting the number of passengers getting on the same grid in the preset time interval.
6. The passenger hot spot parallel prediction method according to any one of claims 1-5, wherein the constructing a parallel GS-SVM algorithm according to the passenger hot spot data comprises:
searching an optimal parameter combination of a Radial Basis Function (RBF) kernel function by using a grid search method;
and applying the optimal parameter combination to an SVM algorithm to obtain a GS-SVM algorithm.
7. The method according to claim 6, wherein the applying the optimal parameter combination to the SVM algorithm to obtain the GS-SVM algorithm comprises:
the optimal parameter of the RBF kernel function is penalty factor C of 900, and kernel parameter gamma of 0.001; the GS-SVM algorithm is as follows:
(x) is the predicted value returned, y is the corresponding true value, Φ (x) is the nonlinear mapping function, the linear insensitive loss function, SV is the support vector, N is the linear insensitive loss functionnsvTo support the number of vectors, αiIs Lagrangian factor, K (x)i,xj)=Φ(xi)Φ(xj) I ═ 1, 2,. l for the RBF kernel.
8. A passenger-carrying hotspot parallel prediction system is characterized by comprising a data acquisition module, a data preprocessing module, an algorithm establishing module and a prediction module;
the data acquisition module is used for acquiring the moving track data of the vehicle;
the data preprocessing module is used for preprocessing the movement track data to obtain hot spot data of passengers;
the algorithm establishing module is used for establishing a parallel GS-SVM algorithm according to the passenger hot spot data;
and the prediction module is used for executing the GS-SVM algorithm based on the RDD and outputting a prediction result.
9. A terminal, characterized in that the terminal comprises a processor, a memory and a communication bus;
the communication bus is used for realizing connection communication between the processor and the memory;
the processor is configured to execute one or more computer programs stored in the memory to implement the steps of the passenger-carrying hot spot parallel prediction method according to any one of claims 1 to 7.
10. A computer storage medium storing one or more computer programs executable by one or more processors to implement the steps of the passenger-carrying hotspot parallel prediction method according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010857329.5A CN112070529A (en) | 2020-08-24 | 2020-08-24 | Passenger carrying hotspot parallel prediction method, system, terminal and computer storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010857329.5A CN112070529A (en) | 2020-08-24 | 2020-08-24 | Passenger carrying hotspot parallel prediction method, system, terminal and computer storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112070529A true CN112070529A (en) | 2020-12-11 |
Family
ID=73660292
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010857329.5A Pending CN112070529A (en) | 2020-08-24 | 2020-08-24 | Passenger carrying hotspot parallel prediction method, system, terminal and computer storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112070529A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112949939A (en) * | 2021-03-30 | 2021-06-11 | 福州市电子信息集团有限公司 | Taxi passenger carrying hotspot prediction method based on random forest model |
CN114061604A (en) * | 2021-10-12 | 2022-02-18 | 贵州民族大学 | Passenger carrying route recommendation method, device and system based on movement track big data |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080133434A1 (en) * | 2004-11-12 | 2008-06-05 | Adnan Asar | Method and apparatus for predictive modeling & analysis for knowledge discovery |
CN104578053A (en) * | 2015-01-09 | 2015-04-29 | 北京交通大学 | Power system transient stability prediction method based on disturbance voltage trajectory cluster features |
CN105371857A (en) * | 2015-10-14 | 2016-03-02 | 山东大学 | Device and method for constructing road network topology based on bus GNSS space-time tracking data |
CN107316501A (en) * | 2017-06-28 | 2017-11-03 | 北京航空航天大学 | A kind of SVMs Travel Time Estimation Method based on grid search |
CN108256924A (en) * | 2018-02-26 | 2018-07-06 | 上海理工大学 | A kind of product marketing forecast device |
CN110264706A (en) * | 2019-04-07 | 2019-09-20 | 武汉理工大学 | A kind of unloaded taxi auxiliary system excavated based on big data |
CN110347937A (en) * | 2019-06-27 | 2019-10-18 | 哈尔滨工程大学 | A kind of taxi intelligent seeks objective method |
WO2020063690A1 (en) * | 2018-09-25 | 2020-04-02 | 新智数字科技有限公司 | Electrical power system prediction method and apparatus |
-
2020
- 2020-08-24 CN CN202010857329.5A patent/CN112070529A/en active Pending
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080133434A1 (en) * | 2004-11-12 | 2008-06-05 | Adnan Asar | Method and apparatus for predictive modeling & analysis for knowledge discovery |
CN104578053A (en) * | 2015-01-09 | 2015-04-29 | 北京交通大学 | Power system transient stability prediction method based on disturbance voltage trajectory cluster features |
CN105371857A (en) * | 2015-10-14 | 2016-03-02 | 山东大学 | Device and method for constructing road network topology based on bus GNSS space-time tracking data |
CN107316501A (en) * | 2017-06-28 | 2017-11-03 | 北京航空航天大学 | A kind of SVMs Travel Time Estimation Method based on grid search |
CN108256924A (en) * | 2018-02-26 | 2018-07-06 | 上海理工大学 | A kind of product marketing forecast device |
WO2020063690A1 (en) * | 2018-09-25 | 2020-04-02 | 新智数字科技有限公司 | Electrical power system prediction method and apparatus |
CN110264706A (en) * | 2019-04-07 | 2019-09-20 | 武汉理工大学 | A kind of unloaded taxi auxiliary system excavated based on big data |
CN110347937A (en) * | 2019-06-27 | 2019-10-18 | 哈尔滨工程大学 | A kind of taxi intelligent seeks objective method |
Non-Patent Citations (2)
Title |
---|
彭珍瑞;孟建军;祝磊;蒋兆远;: "基于支持向量机的铁路客运量预测", 辽宁工程技术大学学报, no. 02 * |
颜七笙;王士同;: "公路旅游客流量预测的支持向量回归模型", 计算机工程与应用, no. 09, pages 233 - 235 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112949939A (en) * | 2021-03-30 | 2021-06-11 | 福州市电子信息集团有限公司 | Taxi passenger carrying hotspot prediction method based on random forest model |
CN112949939B (en) * | 2021-03-30 | 2022-12-06 | 福州市电子信息集团有限公司 | Taxi passenger carrying hotspot prediction method based on random forest model |
CN114061604A (en) * | 2021-10-12 | 2022-02-18 | 贵州民族大学 | Passenger carrying route recommendation method, device and system based on movement track big data |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Syed et al. | Performance evaluation of distributed machine learning for load forecasting in smart grids | |
Chen et al. | Distributed modeling in a MapReduce framework for data-driven traffic flow forecasting | |
Zhang et al. | Distributed shortest path query processing on dynamic road networks | |
Li et al. | Deep learning based parking prediction on cloud platform | |
CN112070529A (en) | Passenger carrying hotspot parallel prediction method, system, terminal and computer storage medium | |
CN112069229B (en) | Optimal waiting point recommendation method and system for big data of moving track | |
CN104750780A (en) | Hadoop configuration parameter optimization method based on statistic analysis | |
CN111860621A (en) | Data-driven distributed traffic flow prediction method and system | |
Sarwat | Interactive and scalable exploration of big spatial data--a data management perspective | |
Sbai et al. | A real-time decision support system for big data analytic: A case of dynamic vehicle routing problems | |
Xia et al. | A parallel NAW-DBLSTM algorithm on Spark for traffic flow forecasting | |
Han et al. | iETA: A Robust and Scalable Incremental Learning Framework for Time-of-Arrival Estimation | |
CN112070280A (en) | Real-time traffic flow parallel prediction method, system, terminal and storage medium | |
Toader et al. | A new modelling framework over temporal graphs for collaborative mobility recommendation systems | |
CN112967495B (en) | Short-time traffic flow prediction method and system based on big data of movement track | |
CN114916013B (en) | Edge task unloading delay optimization method, system and medium based on vehicle track prediction | |
Sambo et al. | Integration of GPS and satellite images for detection and classification of fleet hotspots | |
CN114061604A (en) | Passenger carrying route recommendation method, device and system based on movement track big data | |
Hu et al. | Partition Selection for Large‐Scale Data Management Using KNN Join Processing | |
Wang et al. | A Second-Order HMM Trajectory Prediction Method based on the Spark Platform. | |
CN110019343A (en) | A kind of new energy meteorological data management method and system | |
Xie et al. | Construction for the city taxi trajectory data analysis system by Hadoop platform | |
Xia et al. | A parallel fusion method for heterogeneous multi-sensor transportation data | |
Heiler et al. | Comparing implementation variants of distributed spatial join on spark | |
Wang | Research on moving objects trajectories collection based on data mining |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |