CN112070529A - Passenger carrying hotspot parallel prediction method, system, terminal and computer storage medium - Google Patents

Passenger carrying hotspot parallel prediction method, system, terminal and computer storage medium Download PDF

Info

Publication number
CN112070529A
CN112070529A CN202010857329.5A CN202010857329A CN112070529A CN 112070529 A CN112070529 A CN 112070529A CN 202010857329 A CN202010857329 A CN 202010857329A CN 112070529 A CN112070529 A CN 112070529A
Authority
CN
China
Prior art keywords
data
passenger
parallel
svm algorithm
track data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010857329.5A
Other languages
Chinese (zh)
Inventor
夏大文
郑永玲
白宇
蒋顺英
杨楠
李华青
高晓楠
冯夫健
严晓波
魏嘉银
张乾
梁燕军
王林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guizhou Minzu University
Original Assignee
Guizhou Minzu University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guizhou Minzu University filed Critical Guizhou Minzu University
Priority to CN202010857329.5A priority Critical patent/CN112070529A/en
Publication of CN112070529A publication Critical patent/CN112070529A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data
    • G06Q30/0202Market predictions or forecasting for commercial activities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/29Geographical information databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/245Classification techniques relating to the decision surface
    • G06F18/2451Classification techniques relating to the decision surface linear, e.g. hyperplane
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/40Business processes related to the transportation industry

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Strategic Management (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Finance (AREA)
  • Development Economics (AREA)
  • Accounting & Taxation (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • General Business, Economics & Management (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Fuzzy Systems (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Game Theory and Decision Science (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Physics (AREA)
  • Remote Sensing (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Resources & Organizations (AREA)
  • Primary Health Care (AREA)
  • Tourism & Hospitality (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a method, a system, a terminal and a computer storage medium for passenger-carrying hot spot parallel prediction, wherein the method comprises the following steps: acquiring moving track data of a vehicle; preprocessing the moving track data to obtain hot spot data of passengers; constructing a parallel GS-SVM algorithm according to the hot spot data of the passengers; executing a GS-SVM algorithm based on the RDD, and outputting a prediction result; the method comprises the steps of obtaining movement track data, preprocessing the movement track data, obtaining passenger hot spot data, further constructing a parallel GS-SVM algorithm, combining with RDD under Spark to realize the parallel GS-SVM algorithm, utilizing the capability of the GS-SVM algorithm for capturing nonlinear information, outputting a prediction result, effectively improving prediction accuracy, robustness and instantaneity, and solving the technical problems of distributed storage and parallel calculation of large data of the movement track.

Description

Passenger carrying hotspot parallel prediction method, system, terminal and computer storage medium
Technical Field
The invention relates to the field of passenger hotspot prediction based on big data of a movement track, in particular to a passenger carrying hotspot parallel prediction method, a system, a terminal and a computer storage medium.
Background
In the big data driven intelligent traffic era, the passenger hot spot prediction is important for the profitability of a driver (such as a taxi driver), and the passenger hot spot prediction is carried out on the passenger hot spots in the future time interval by utilizing the number of passengers on the hot spots of the historical passengers, so that the driver is helped to quickly find the passengers, the cruising duration is shortened, the cost is reduced, and the energy consumption and the environmental pollution are reduced.
The traditional passenger hot spot prediction algorithm has the defects of low prediction precision, poor adaptability and the like, and particularly along with the explosive growth of traffic big data, the traditional serial algorithm still has limitations in passenger hot spot prediction based on the traditional single-machine centralized computing platform, so that the technical problems of memory consumption, high I/O (input/output) overhead, low processing efficiency and poor expandability are easily caused. Meanwhile, in the prior art, only the linear relation is considered in passenger hot spot prediction, so that the technical problem of low prediction accuracy is caused.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a passenger carrying hotspot parallel prediction method, a passenger carrying hotspot parallel prediction system, a passenger carrying hotspot parallel prediction terminal and a computer storage medium, aiming at the defects of the prior art, the passenger carrying hotspot parallel prediction method, the passenger carrying hotspot parallel prediction system, the passenger carrying hotspot parallel prediction terminal and the computer storage medium have the characteristics of high prediction precision, high processing speed, strong expandability and the like, and can effectively solve the technical problems of high memory consumption and time consumption in passenger hotspot distributed storage and parallel computation based on large moving track data.
The technical scheme for solving the technical problems is as follows: a passenger carrying hotspot parallel prediction method, a system, a terminal and a computer storage medium comprise the following steps:
acquiring moving track data of a vehicle;
preprocessing the movement track data to obtain passenger hot spot data;
constructing a parallel GS-SVM algorithm according to the passenger hot spot data;
and executing the GS-SVM algorithm based on the RDD, and outputting a prediction result.
The invention has the beneficial effects that: the method comprises the steps of obtaining moving track data, obtaining passenger hot spot data after preprocessing, further constructing a parallel GS-SVM algorithm, further combining RDD under Spark to realize the parallel GS-SVM algorithm, utilizing the capability of the GS-SVM algorithm for capturing nonlinear information, outputting a prediction result, effectively improving prediction accuracy, robustness and real-time performance, and solving the technical problems of distributed storage and parallel calculation of large moving track data.
Further, the preprocessing the movement trajectory data to obtain passenger hotspot data includes:
extracting track data of which the vehicle operation state is changed from 0 to 1 according to the movement track data, and storing the track data of which the operation state is 1, wherein 0 represents empty vehicles, and 1 represents passenger carrying;
according to the track data with the operation state of 1, carrying out grid division on the road network;
and counting the track data with the operation state of 1 in the preset time interval in the grid to obtain the hot spot data of the passengers.
The technical problems of high memory consumption and time consumption in distributed storage and parallel calculation of the big moving track data are solved by sequentially carrying out data extraction, data sorting, road network meshing and data statistics on the big moving track data on the basis of a Spark parallel processing framework under a Hadoop distributed computing platform.
Further, the extracting trajectory data of the vehicle operation state changing from 0 to 1 according to the movement trajectory data and saving trajectory data of which the operation state is 1 includes:
converting the moving track data into an RDD elastic distribution data set in Spark by reading the moving track data in the HDFS file;
the RDD is segmented, invalid data are filtered, then required fields are extracted, the required fields comprise vehicle IDs, operation states, time, longitude and latitude, the track data when the operation states of the same vehicle IDs are continuously 0, 1 and 1 are determined according to vehicle ID sorting, the last track data with the operation state of 1 is stored, and sorting is carried out according to time.
The method has the advantages that the moving track data are screened and sorted through the RDD, so that passenger hot-spot numbers are accurately obtained according to the track data.
Further, the mesh division of the road network according to the trajectory data with the operation state of 1 includes:
filtering the track data according to the selected target longitude range and the selected target latitude range;
and performing grid division on the longitude and latitude of the filtered track data according to a preset grid step length.
The method has the advantages that the track data are filtered according to the longitude and the latitude, the track data grids are further divided, and therefore the statistics of passenger hot spot data in the same follow-up grids is facilitated.
Further, the counting the trajectory data of which the operation state is 1 in the preset time interval in the grid to obtain the passenger hotspot data includes:
dividing the time of day into preset time intervals at intervals of preset duration, and counting the number of passengers getting on the same grid in the preset time intervals.
The method has the advantages that the method divides each day into preset time intervals, and counts the passenger number in the preset time intervals by the road network gridding, so that the passenger number in the hot spot of the passenger can be accurately positioned, and the robustness and the accuracy of the passenger hot spot prediction can be improved.
Further, the constructing of the parallel GS-SVM algorithm according to the passenger hotspot data comprises:
searching an optimal parameter combination of a Radial Basis Function (RBF) kernel function by using a grid search method;
and applying the optimal parameter combination to an SVM algorithm to obtain a GS-SVM algorithm.
The method has the advantages that built-in radial basis function parameters of the SVM algorithm are optimized based on a grid search method, the GS-SVM algorithm is determined according to the optimal parameter combination, and then the number of passengers on the passenger hot spots in the future time interval is predicted by combining the GS-SVM algorithm with real passenger hot spot data, so that the real-time performance and expandability of passenger hot spot prediction are improved, and the technical problem that the passenger hot spot prediction precision is low due to the fact that the existing passenger hot spot prediction algorithm only considers linear information and ignores the nonlinear problem of the passenger hot spots is solved.
Further, the step of applying the optimal parameter combination to the SVM algorithm to obtain the GS-SVM algorithm includes:
the optimal parameter of the RBF kernel function is penalty factor C of 900, kernel parameter gamma of 0.001, and the GS-SVM algorithm is as follows:
Figure BDA0002646810510000041
Figure BDA0002646810510000042
Figure BDA0002646810510000043
(x) is the predicted value returned, y is the corresponding true value, Φ (x) is the nonlinear mapping function, the linear insensitive loss function, SV is the support vector, N is the linear insensitive loss functionnsvTo support the number of vectors, αiIs Lagrangian factor, K (x)i,xj)=Φ(xi)Φ(xj) I ═ 1, 2,. l for the RBF kernel.
The method has the advantages that the grid search method is used for optimizing built-in radial basis function parameters of the SVM algorithm, when C is 900 and gamma is 0.001, the prediction accuracy of the SVM algorithm is superior to other parameter combinations when the C is applied to the RBF kernel function of the SVM algorithm, the performance of the parallel SVM algorithm is globally optimal, and then prediction is performed by combining the obtained GS-SVM algorithm with passenger hotspots in history and current time intervals, so that the real-time performance and expandability of passenger hotspot prediction are improved, the problem that the existing passenger hotspot prediction algorithm only considers linear information and ignores the passenger hotspots is solved, the problem that the existing passenger hotspot prediction algorithm is not fully considered, the nonlinear algorithm is easily influenced by parameters and is easily trapped into local optimization is solved, and the technical problem that the passenger hotspot prediction accuracy is low is caused.
In order to solve the technical problem, an embodiment of the present invention further provides a system for predicting passenger-carrying hotspots in parallel, including a data acquisition module, a data preprocessing module, an algorithm establishing module, and a prediction module;
the data acquisition module is used for acquiring the moving track data of the vehicle;
the data preprocessing module is used for preprocessing the movement track data to obtain hot spot data of passengers;
the algorithm establishing module is used for establishing a parallel GS-SVM algorithm according to the passenger hot spot data;
and the prediction module is used for executing the GS-SVM algorithm based on the RDD, outputting a prediction result and outputting the prediction result.
In order to solve the above technical problem, an embodiment of the present invention further provides a terminal, where the terminal includes a processor, a memory, and a communication bus;
the communication bus is used for realizing connection communication between the processor and the memory;
the processor is configured to execute one or more computer programs stored in the memory to implement the steps of the passenger-carrying hot spot parallel prediction method as described above.
To solve the above technical problem, embodiments of the present invention further provide a computer storage medium storing one or more computer programs, which are executable by one or more processors to implement the steps of the passenger hotspot prediction method described above.
Drawings
Fig. 1 is a flowchart of a concurrent prediction method for a hotspot with a passenger in the embodiment of the present invention;
FIG. 2 is a flowchart of a method for obtaining hot spot data of passengers according to an embodiment of the present invention;
FIG. 3 is a framework diagram of a passenger hot spot parallel prediction implementation provided by an embodiment of the present invention;
fig. 4 is a functional diagram for respectively implementing distributed storage and parallel computation based on HDFS and Spark according to an embodiment of the present invention;
FIG. 5 is a diagram of an HDFS process communication framework according to an embodiment of the present invention;
fig. 6 is a flowchart of a spare computing task according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of a terminal according to an embodiment of the present invention;
fig. 8 is a schematic structural diagram of a concurrent prediction system for a hotspot with passenger provided by an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments that can be derived by one of ordinary skill in the art from the embodiments given herein are intended to be within the scope of the present invention.
Example 1
The noun explanations in the present invention are shown in Table 1:
TABLE 1
Figure BDA0002646810510000061
Figure BDA0002646810510000071
Apache Hadoop architecture:
apache Hadoop is a reliable and extensible open source distributed computing architecture that can provide a stable and reliable interface for applications in a cluster consisting of a large amount of inexpensive hardware. The method makes full use of the computing and storing capability of the cluster, constructs a scalable and extensible large-data batch processing architecture with high reliability and strong fault tolerance, and realizes distributed storage and parallel computing of large-scale data.
HDFS and MapReduce are core components of the Hadoop architecture, and are also open source implementations based on GFS (Google File System) and Google MapReduce. Hadoop respectively realizes distributed storage and parallel computation through HDFS and MapReduce, and NameNode and DataNode complete HDFS function, JobTracker and TaskTracker complete MapReduce function, and refer to FIG. 4. In addition, Hadoop also includes Hadoop Common, Hadoop YARN, Ambari, Avro, Cassandra, Chukwa, HBase, Hive, Mahout, Pig, Spark, Tez, and ZooKeeper, among others.
Hadoop Distributed File System (HDFS):
the HDFS (Hadoop Distributed File System) is a Distributed File System that can be deployed on inexpensive hardware to realize high-throughput parallel data access, and can provide high-performance, strong fault-tolerant, and reliable storage of large-scale data. Referring to fig. 5, the HDFS adopts a Master/Slave mode, and is composed of a NameNode (manager) node, a plurality of DataNode (worker) nodes, and an HDFS Client, and realizes communication among the NameNode, the DataNode, and the HDFS Client processes through an RPC mechanism of Hadoop.
Spark parallel programming model:
spark is a parallel programming model (often called "Spark parallel processing framework") that can process large-scale datasets and perform parallel computing tasks on Hadoop clusters consisting of hundreds or thousands of servers. The main idea is rdd (resource Distributed dataset), which stores all calculated data in a Distributed memory. The Cluster Manager is mainly used for acquiring external services of resources on the Cluster; running the nodes of the application codes in the cluster by using a Worker Node; the Executor is a process for starting applications on a worker node, is responsible for running tasks and storing data in a memory or a disk, and each application has independent executors. And after executing the Task, returning the result to the Driver end.
Referring to fig. 1, a method for predicting a hot spot of a passenger carrier in parallel, as shown in fig. 1, includes the following steps:
s1: acquiring moving track data of a vehicle;
s2: preprocessing the moving track data to obtain hot spot data of passengers;
s3: constructing a parallel GS-SVM algorithm according to the hot spot data of the passengers;
s4: and executing a GS-SVM algorithm based on the RDD, and outputting a prediction result.
The method has the advantages that the parallel GS-SVM algorithm is constructed by obtaining the moving track data, the parallel GS-SVM algorithm is realized through RDD based on a Spark parallel distributed computing platform, the prediction result is output, the prediction accuracy, robustness and real-time performance are effectively improved, and the technical problems of high memory consumption and time consumption in distributed storage and parallel computing of the moving track big data are solved.
In the embodiment, vehicle GPS track data is adopted, passenger hotspots are accurately positioned through road network gridding, the number of passengers on the passenger hotspots in the grid is counted by a fixed grid, a sampling time interval is determined, and the number of passengers on the passenger hotspots in the same grid preset time interval is obtained. By using the parallel GS-SVM algorithm, the passenger hot spots in the future time interval of any grid can be predicted according to the passenger hot spots in the historical time interval and the current time interval in the grid.
Under a Hadoop distributed computing platform, based on a Spark parallel processing framework, as shown in fig. 2, S2 specifically includes:
s201: extracting track data of which the operation state is changed from 0 to 1 according to the moving track data, and storing the track data of which the operation state is 1, wherein 0 represents empty vehicles and 1 represents passenger carrying;
s202: carrying out mesh division on the road network according to the data with the operation state of 1;
s203: and counting the track data with the operation state of 1 in a preset time interval in the grid to obtain the hot spot data of the passengers.
In the embodiment, the moving track data is subjected to data extraction, data sorting, road network meshing and data statistics in sequence, so that the influence of error data on the precision of the GS-SVM algorithm is reduced.
In this embodiment, in S2, in a Hadoop distributed computing platform, the specific process of S201 based on the Spark parallel processing framework is as follows:
the method comprises the steps of converting moving track data into an RDD elastic distribution data set in Spark by reading the moving track data in an HDFS file, slicing the RDD, filtering out invalid data, extracting required fields, sequencing the required fields according to vehicle IDs, determining track data with the same vehicle IDs, wherein the operating states of the track data are 0, 1 and 1 continuously, storing the last track data with the operating state of 1, and sequencing the track data according to time. Namely, track data of the same vehicle ID, wherein the operation states of the same vehicle ID are continuously empty, passenger carrying and passenger carrying; for example, the movement trajectory data of a-B is assumed to be in the trajectory of the vehicle ID1 at the point a-B, where a-a1 is empty, a1-a2 is passenger-carrying, a2-A3 is passenger-carrying, and A3-B is empty, and the trajectory data corresponding to a2-A3 is stored.
In this embodiment, in S2, in the Hadoop distributed computing platform, the specific process of S202 based on the Spark parallel processing framework is as follows:
filtering the stored track data according to the selected target longitude range and the selected target latitude range;
and performing grid division on the longitude and latitude of the filtered track data according to a preset grid step length.
For example, a target longitude range (39.8283918700-39.9909153300) and a target latitude range (116.2611551300-116.4954361600) are selected, the track data corresponding to the A2-A3 are filtered, the track data which do not belong to the target longitude range and the target latitude range are filtered, and then the longitude and latitude of the filtered track data are subjected to grid division. The preset grid step length can be flexibly adjusted according to actual requirements, for example, the preset grid step length is 0.01 degrees, and the grid division mainly divides the track data obtained after filtering into 10 × 10 grids.
In S2, in the Hadoop distributed computing platform, the specific process of the Spark parallel processing framework-based method in S203 is as follows:
and dividing the time of day according to a preset time interval, and counting the number of passengers getting on the same grid in the preset time interval.
Wherein the number of passengers boarding can be determined from a change from 0 to 1 in accordance with the vehicle GPS operation state; for example, dividing the day by 15 minutes, counting passengers on every 15 minutes in the same grid, as respectively counting 8: 00-8: 15. 8: 15-8: 30 passengers in equal time intervals; assuming that the track data corresponding to A2-A3 and the track data corresponding to C1-C2 are located in the same grid, counting how many passengers are on the bus by the track data corresponding to A2-A3 and C1-C2 every 15 minutes in the day in the grid, and further determining the number of passengers on the bus in each grid.
In this embodiment, S3 specifically includes:
the parameters of the kernel function built in the SVM algorithm are not the optimal parameters, so that the optimal parameter combination of the RBF kernel function is searched by utilizing a grid search method;
and applying the optimal parameter combination to the SVM algorithm to obtain the GS-SVM algorithm.
Firstly, the number of passengers on a passenger hotspot is accurately positioned through road network gridding, and the robustness and the accuracy of passenger hotspot prediction are improved; secondly, a grid search method optimizes built-in radial basis function parameters of an SVM algorithm, and then predicts the number of passengers getting on a passenger hotspot at a future time interval by combining real passenger hotspot data based on a GS-SVM algorithm so as to improve the real-time performance and expandability of passenger hotspot prediction, and solve the technical problems that the conventional passenger hotspot prediction algorithm only considers linear information and ignores nonlinear information, and the nonlinear algorithm is easily influenced by parameters and is easy to fall into local optimum without fully considering the nonlinear algorithm, so that the passenger hotspot prediction precision is low.
The SVM algorithm is also called a support vector machine, has obvious advantages in processing nonlinear problems, can transform a low-dimensional nonlinear problem into a linear problem by projecting a high-dimensional space, and when the SVM algorithm is used for regression prediction, the linear regression function is as follows:
f(x)=wΦ(x)+b (1)
where Φ (x) is the nonlinear mapping function, w is the weight vector, and b is the classification threshold.
Defining a linear insensitive loss function
Figure BDA0002646810510000111
Where f (x) is the predicted value returned by the regression function, where y is the corresponding true value.
To determine the values of w, b, a relaxation variable ξ is introducedi,ξi *
Figure BDA0002646810510000112
Wherein C is a penalty factor, and the larger C is, the larger the sample penalty of the training error is; the error requirement of the regression function is specified, and smaller error means smaller error of the regression function.
The Lagrange Largrange function is introduced as follows to convert it into dual form solution:
Figure BDA0002646810510000113
wherein, K (x)i,xj)=Φ(xi)Φ(xj) As a RBF kernel function, αiIs the lagrange factor.
In this embodiment, a grid search method is used to find an optimal parameter combination of the RBF kernel function C (penalty factor) and γ (kernel parameter). Specifically, a grid search method is utilized to traverse all grids to search for the optimal combination of radial basis function parameters, cross validation is utilized to search for the overall optimal parameter combination, and the SVM algorithm is effectively prevented from falling into local optimization; in this embodiment, C ═ 100,30,050,07,009, γ ═ 0.001,0.003,0.005,0.007,0.009, etc. are defined.
Obtaining α ═ α from formula (4)12,…,αl]Is α*=[α1 *2 *,…,αl *]Then there is
Figure BDA0002646810510000121
Figure BDA0002646810510000122
Wherein SV is a support vector, NnsvThe number of the support vectors.
Combining the above formulas (1) and (5), the grid search method is introduced to obtain an optimal parameter combination C of 900 and γ of 0.001, and an optimized GS-SVM algorithm is obtained as follows:
Figure BDA0002646810510000123
in the present embodiment, when the optimal parameter combination C is 900 and γ is 0.001, the prediction accuracy of the SVM algorithm is better than that of other parameter combinations when applied to the RBF kernel function of the SVM algorithm.
In the embodiment, the capacity of capturing nonlinear information by using a GS-SVM algorithm is combined with historical data, and the number of passengers getting on the train in a preset time interval is predicted; for example, in conjunction with the number of passengers boarding within the current 15 minutes in the grid, the number of passenger hotspots within 15 minutes for the next time interval within the same grid is predicted.
In the present embodiment, S4 includes:
step 1: the RBF function C (penalty factor) and gamma (nuclear parameter) optimal parameter combination of the SVM algorithm is found through a Spark parallel grid search method, so that the SVM algorithm has good generalization capability.
Step 2: and (3) applying the optimal parameter combination of the Spark parallel grid search method obtained in the step (1) to an RBF (radial basis function) of an SVM (support vector machine) algorithm to obtain a Spark-based parallel GS-SVM algorithm.
And step 3: and predicting passenger hot spots in the same grid in a future time interval by utilizing the capability of capturing nonlinear information by using a Spark-based parallel GS-SVM algorithm.
As shown in fig. 6, vehicle GPS data is converted into an RDD elastic distribution data set in Spark, a grid search range C is determined through a large number of experiments, where [10, 03, 00,5, γ ═ 0.001,0.003,0.005,0.007,0.009], all grids are traversed by Spark parallel grid search to find an optimal combination of RBF function C (penalty factor) and γ (kernel parameter) of the SVM algorithm, and then a GS-SVM algorithm is obtained, passenger number of passengers in history and current time interval is input into the GS-SVM algorithm, and passenger number of heat points in future time interval in the same grid is predicted through the GS-SVM algorithm.
Under a Hadoop distributed computing platform, algorithm implementation is carried out based on a Spark parallel processing framework and combined with RDD, the technical problems of distributed storage and parallel computing of big data of a moving track are solved, an optimal parameter combination searched by a grid search method is applied to an SVM algorithm, prediction is carried out by combining passenger hotspots at historical and current time intervals, so that the real-time performance and expandability of passenger hotspot prediction are improved, and the technical problems that the existing passenger hotspot prediction only considers linear information and ignores nonlinear information, and the nonlinear algorithm is not fully considered, is easily influenced by parameters and easily falls into local optimization, and accordingly the passenger hotspot prediction precision is low are solved.
The implementation principle of the embodiment is as follows: referring to fig. 3, first, data is acquired. And collecting GPS track data generated by the vehicle to obtain big moving track data. Second, data is preprocessed. Obtaining a passenger hotspot sequence through data extraction, data sorting, road network gridding and data statistics; specifically, based on a Spark parallel distributed computing platform, firstly, data is converted into an RDD (resource description device) elastic distribution data set in Spark, then, RDD is sliced, invalid data are filtered, required fields (such as vehicle ID, operation state, time, longitude and latitude) are extracted, and sequencing is carried out according to the vehicle ID; secondly, searching tracks (representing the occurrence of a passenger carrying event) with the same vehicle ID (identity) and continuous operation states of 0 (empty vehicle), 1 (passenger carrying) and 1 (passenger carrying), then reserving the vehicle ID, the operation state, the time, the longitude and the latitude with the third operation state of 1, and then sequencing according to the time; and finally, selecting longitudes and latitudes within a certain range, filtering the stored track data result within the range, carrying out grid division on the longitudes and latitudes of the filtered track data, dividing the time data of one day into time sequence data with a time interval of 15 minutes, and counting the number of passengers on the same grid within the time interval of 15 minutes so as to solve the problems of distributed storage, parallel calculation and the like of the large-scale mobile track data. Third, data modeling. And constructing a parallel GS-SVM algorithm based on a Spark parallel processing frame so as to improve the robustness and accuracy of hot spot prediction of passengers. The algorithm optimizes a Radial Basis Function (RBF) of a Support Vector Machine (SVM) by using a grid search method, seeks a global optimal parameter combination by using cross validation, and combines passenger load numbers of historical time intervals to predict passenger hot load numbers of future time intervals. And finally, realizing the algorithm. Based on a Spark parallel processing framework, the parallelization of the GS-SVM algorithm is realized by combining RDD, so that the real-time performance and the expandability of the hot spot prediction of passengers are improved.
Example 2
The present embodiment provides a terminal, as shown in fig. 7, which includes a processor 701, a memory 702, and a communication bus 703;
the communication bus 703 is used for realizing connection communication between the processor 701 and the memory 702;
the processor 701 is configured to execute one or more computer programs stored in the memory 702 to implement the steps of the passenger-carrying hot spot parallel prediction method in the foregoing embodiments, which are not described herein again.
On the basis of embodiment 1, this embodiment provides a passenger-carrying hotspot parallel prediction system, as shown in fig. 8, including a data acquisition module 801, a data preprocessing module 802, an algorithm establishing module 803, and a prediction module 804;
the data acquisition module 801 is used for acquiring movement track data of a vehicle;
the data preprocessing module 802 is configured to preprocess the movement trajectory data to obtain hot spot data of the passenger;
the algorithm establishing module 803 is used for establishing a parallel GS-SVM algorithm according to the passenger hot spot data;
the prediction module 804 is configured to execute a GS-SVM algorithm based on the RDD and output a prediction result.
The method has the advantages that the large data of the moving track are obtained to construct a parallel GS-SVM algorithm, the parallel GS-SVM algorithm is realized by combining with the RDD under the Spark framework, the prediction result is output, the prediction precision is effectively improved, and the technical problems of distributed storage and parallel calculation of the large data of the moving track are solved.
The data preprocessing module 802 is specifically configured to:
extracting track data of which the vehicle operation state is changed from 0 to 1 according to the moving track data, and storing the track data of which the operation state is 1, wherein 0 represents empty vehicle and 1 represents passenger carrying;
carrying out mesh division on the road network according to the track data with the operation state of 1;
and counting the track data with the operation state of 1 in a preset time interval in the grid to obtain the hot spot data of the passengers.
And the moving track data is subjected to data extraction, data sorting, road network gridding and data statistics in sequence, so that the influence of error data on the SVM algorithm is reduced.
The process of obtaining the passenger hot spot based on the Spark parallel processing framework by the data preprocessing module under the Hadoop distributed computing platform is as follows:
converting the moving track data into an RDD elastic distribution data set in Spark by reading the moving track data in the HDFS file;
slicing the RDD, filtering invalid data, extracting required fields, wherein the required fields comprise vehicle IDs, operation states, time, longitude and latitude, sequencing according to the vehicle IDs, determining track data with the same vehicle IDs and continuous operation states of 0, 1 and 1, storing the last track data with the operation state of 1, and sequencing according to the time;
filtering the stored track data according to the selected target longitude range and the selected target latitude range;
performing grid division on the longitude and latitude of the filtered track data according to a preset grid step length;
and dividing the time of day according to a preset time interval, and counting the number of passengers getting on the same grid in the preset time interval.
And the moving track data is subjected to data extraction, data sorting, road network gridding and data statistics in sequence, so that the influence of error data on the SVM algorithm is reduced.
Further, the specific process of the algorithm establishing module 803 establishing the parallel GS-SVM algorithm according to the training data is as follows:
searching an optimal parameter combination of a Radial Basis Function (RBF) kernel function by using a grid search method;
and applying the optimal parameter combination to the SVM algorithm to obtain the GS-SVM algorithm.
And predicting passenger hot spots in 15 minutes at time intervals in the same grid by utilizing the capability of capturing nonlinear information by using a Spark-based parallel GS-SVM algorithm.
Firstly, the number of passengers on a passenger hotspot is accurately positioned through road network gridding, and the robustness and the accuracy of passenger hotspot prediction are improved; secondly, built-in radial basis function parameters of the SVM algorithm are optimized based on a grid search method, the number of passengers getting on a hot spot at a future time interval is predicted by combining real passenger hot spot data, so that the real-time performance and expandability of passenger hot spot prediction are improved, and the technical problem that the passenger hot spot prediction precision is low due to the fact that the existing passenger hot spot prediction method only considers linear information and ignores nonlinear information and does not fully consider that the nonlinear algorithm is easily influenced by parameters and is easy to fall into local optimization is solved.
The algorithm establishing module 803 applies the optimal parameter combination to the SVM algorithm to obtain the GS-SVM algorithm, which includes:
based on a number of experiments, C ═ 100,300,500,700,900, γ ═ 0.001,0.003,0.005,0.007,0.009, were defined, and the SVM algorithm performance reached global optimum when the parameters were combined to C ═ 900 and γ ═ 0.001.
The GS-SVM algorithm is as follows:
Figure BDA0002646810510000161
Figure BDA0002646810510000162
Figure BDA0002646810510000163
(x) is the predicted value returned, y is the corresponding true value, Φ (x) is the nonlinear mapping function, the linear insensitive loss function, SV is the support vector, N is the linear insensitive loss functionnsvTo support the number of vectors, αiIs Lagrangian factor, K (x)i,xj)=Φ(xi)Φ(xj) I ═ 1, 2,. l for the RBF kernel.
The embodiments of the present invention further provide a computer storage medium, where the computer storage medium stores one or more computer programs, and the one or more computer programs may be executed by one or more processors to implement the steps of the method for predicting a hot spot of a passenger carrier in parallel in each of the above embodiments, which are not described herein again.
The beneficial effect of adopting the further scheme is that: the method has the advantages that built-in radial basis function parameters of the SVM algorithm are optimized by using a grid search method, prediction is carried out by combining historical passenger hotspots and passenger hotspots at the current time interval, so that the real-time performance and expandability of passenger hotspot prediction are improved, the non-linear problem that the existing passenger hotspot prediction algorithm only considers linear information and ignores the passenger hotspots is solved, and the technical problem that the passenger hotspot prediction precision is low because the non-linear algorithm is easily influenced by the parameters and falls into local optimization is not fully considered.
The technical solutions provided by the embodiments of the present invention are described in detail above, and the principles and embodiments of the present invention are explained in this patent by applying specific examples, and the descriptions of the embodiments above are only used to help understanding the principles of the embodiments of the present invention; also, while the present invention has been described with respect to particular embodiments and with reference to certain preferred embodiments thereof, it will be understood by those skilled in the art that the foregoing descriptions of the present invention are provided for illustration and not for the purpose of limiting the invention as defined by the appended claims.

Claims (10)

1. A passenger carrying hotspot parallel prediction method is characterized by comprising the following steps:
acquiring moving track data of a vehicle;
preprocessing the movement track data to obtain passenger hot spot data;
constructing a parallel GS-SVM algorithm according to the passenger hot spot data;
and executing the GS-SVM algorithm based on the RDD, and outputting a prediction result.
2. The passenger-carrying hotspot parallel prediction method of claim 1, wherein the preprocessing the movement trajectory data to obtain passenger hotspot data comprises:
extracting track data of which the vehicle operation state is changed from 0 to 1 according to the movement track data, and storing the track data of which the operation state is 1, wherein 0 represents empty vehicles, and 1 represents passenger carrying;
according to the track data with the operation state of 1, carrying out grid division on the road network;
and counting the track data with the operation state of 1 in the preset time interval in the grid to obtain the hot spot data of the passengers.
3. The passenger-carrying hotspot parallel prediction method according to claim 2, wherein the extracting trajectory data of the vehicle operation state from 0 to 1 according to the movement trajectory data and saving trajectory data of which the operation state is 1 comprises:
converting the moving track data into an RDD elastic distribution data set in Spark by reading the moving track data in the HDFS file;
the RDD is segmented, invalid data are filtered, then required fields are extracted, the required fields comprise vehicle IDs, operation states, time, longitude and latitude, the track data with the same vehicle IDs and continuous operation states of 0, 1 and 1 are determined according to vehicle ID sorting, the last vehicle ID with the operation state of 1, the operation states, the time, the longitude and the latitude are stored, and sorting is carried out according to the time.
4. The method according to claim 2, wherein the grid division of the road network according to the trajectory data with the operation state of 1 comprises:
filtering the stored track data according to the selected target longitude range and the selected target latitude range;
and performing grid division on the longitude and latitude of the filtered track data according to a preset grid step length.
5. The passenger-carrying hotspot parallel prediction method of claim 4, wherein the counting of trajectory data with an operation state of 1 in a preset time interval in the grid comprises:
and dividing the time of day according to the preset time interval, and counting the number of passengers getting on the same grid in the preset time interval.
6. The passenger hot spot parallel prediction method according to any one of claims 1-5, wherein the constructing a parallel GS-SVM algorithm according to the passenger hot spot data comprises:
searching an optimal parameter combination of a Radial Basis Function (RBF) kernel function by using a grid search method;
and applying the optimal parameter combination to an SVM algorithm to obtain a GS-SVM algorithm.
7. The method according to claim 6, wherein the applying the optimal parameter combination to the SVM algorithm to obtain the GS-SVM algorithm comprises:
the optimal parameter of the RBF kernel function is penalty factor C of 900, and kernel parameter gamma of 0.001; the GS-SVM algorithm is as follows:
Figure FDA0002646810500000021
Figure FDA0002646810500000022
Figure FDA0002646810500000023
(x) is the predicted value returned, y is the corresponding true value, Φ (x) is the nonlinear mapping function, the linear insensitive loss function, SV is the support vector, N is the linear insensitive loss functionnsvTo support the number of vectors, αiIs Lagrangian factor, K (x)i,xj)=Φ(xi)Φ(xj) I ═ 1, 2,. l for the RBF kernel.
8. A passenger-carrying hotspot parallel prediction system is characterized by comprising a data acquisition module, a data preprocessing module, an algorithm establishing module and a prediction module;
the data acquisition module is used for acquiring the moving track data of the vehicle;
the data preprocessing module is used for preprocessing the movement track data to obtain hot spot data of passengers;
the algorithm establishing module is used for establishing a parallel GS-SVM algorithm according to the passenger hot spot data;
and the prediction module is used for executing the GS-SVM algorithm based on the RDD and outputting a prediction result.
9. A terminal, characterized in that the terminal comprises a processor, a memory and a communication bus;
the communication bus is used for realizing connection communication between the processor and the memory;
the processor is configured to execute one or more computer programs stored in the memory to implement the steps of the passenger-carrying hot spot parallel prediction method according to any one of claims 1 to 7.
10. A computer storage medium storing one or more computer programs executable by one or more processors to implement the steps of the passenger-carrying hotspot parallel prediction method according to any one of claims 1 to 7.
CN202010857329.5A 2020-08-24 2020-08-24 Passenger carrying hotspot parallel prediction method, system, terminal and computer storage medium Pending CN112070529A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010857329.5A CN112070529A (en) 2020-08-24 2020-08-24 Passenger carrying hotspot parallel prediction method, system, terminal and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010857329.5A CN112070529A (en) 2020-08-24 2020-08-24 Passenger carrying hotspot parallel prediction method, system, terminal and computer storage medium

Publications (1)

Publication Number Publication Date
CN112070529A true CN112070529A (en) 2020-12-11

Family

ID=73660292

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010857329.5A Pending CN112070529A (en) 2020-08-24 2020-08-24 Passenger carrying hotspot parallel prediction method, system, terminal and computer storage medium

Country Status (1)

Country Link
CN (1) CN112070529A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112949939A (en) * 2021-03-30 2021-06-11 福州市电子信息集团有限公司 Taxi passenger carrying hotspot prediction method based on random forest model
CN114061604A (en) * 2021-10-12 2022-02-18 贵州民族大学 Passenger carrying route recommendation method, device and system based on movement track big data

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080133434A1 (en) * 2004-11-12 2008-06-05 Adnan Asar Method and apparatus for predictive modeling & analysis for knowledge discovery
CN104578053A (en) * 2015-01-09 2015-04-29 北京交通大学 Power system transient stability prediction method based on disturbance voltage trajectory cluster features
CN105371857A (en) * 2015-10-14 2016-03-02 山东大学 Device and method for constructing road network topology based on bus GNSS space-time tracking data
CN107316501A (en) * 2017-06-28 2017-11-03 北京航空航天大学 A kind of SVMs Travel Time Estimation Method based on grid search
CN108256924A (en) * 2018-02-26 2018-07-06 上海理工大学 A kind of product marketing forecast device
CN110264706A (en) * 2019-04-07 2019-09-20 武汉理工大学 A kind of unloaded taxi auxiliary system excavated based on big data
CN110347937A (en) * 2019-06-27 2019-10-18 哈尔滨工程大学 A kind of taxi intelligent seeks objective method
WO2020063690A1 (en) * 2018-09-25 2020-04-02 新智数字科技有限公司 Electrical power system prediction method and apparatus

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080133434A1 (en) * 2004-11-12 2008-06-05 Adnan Asar Method and apparatus for predictive modeling & analysis for knowledge discovery
CN104578053A (en) * 2015-01-09 2015-04-29 北京交通大学 Power system transient stability prediction method based on disturbance voltage trajectory cluster features
CN105371857A (en) * 2015-10-14 2016-03-02 山东大学 Device and method for constructing road network topology based on bus GNSS space-time tracking data
CN107316501A (en) * 2017-06-28 2017-11-03 北京航空航天大学 A kind of SVMs Travel Time Estimation Method based on grid search
CN108256924A (en) * 2018-02-26 2018-07-06 上海理工大学 A kind of product marketing forecast device
WO2020063690A1 (en) * 2018-09-25 2020-04-02 新智数字科技有限公司 Electrical power system prediction method and apparatus
CN110264706A (en) * 2019-04-07 2019-09-20 武汉理工大学 A kind of unloaded taxi auxiliary system excavated based on big data
CN110347937A (en) * 2019-06-27 2019-10-18 哈尔滨工程大学 A kind of taxi intelligent seeks objective method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
彭珍瑞;孟建军;祝磊;蒋兆远;: "基于支持向量机的铁路客运量预测", 辽宁工程技术大学学报, no. 02 *
颜七笙;王士同;: "公路旅游客流量预测的支持向量回归模型", 计算机工程与应用, no. 09, pages 233 - 235 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112949939A (en) * 2021-03-30 2021-06-11 福州市电子信息集团有限公司 Taxi passenger carrying hotspot prediction method based on random forest model
CN112949939B (en) * 2021-03-30 2022-12-06 福州市电子信息集团有限公司 Taxi passenger carrying hotspot prediction method based on random forest model
CN114061604A (en) * 2021-10-12 2022-02-18 贵州民族大学 Passenger carrying route recommendation method, device and system based on movement track big data

Similar Documents

Publication Publication Date Title
Sasaki A survey on IoT big data analytic systems: Current and future
Chen et al. Distributed modeling in a MapReduce framework for data-driven traffic flow forecasting
Zhang et al. Distributed shortest path query processing on dynamic road networks
Li et al. Deep learning based parking prediction on cloud platform
Syed et al. Performance evaluation of distributed machine learning for load forecasting in smart grids
Xie et al. Elite: an elastic infrastructure for big spatiotemporal trajectories
CN112070529A (en) Passenger carrying hotspot parallel prediction method, system, terminal and computer storage medium
CN112069229B (en) Optimal waiting point recommendation method and system for big data of moving track
CN111860621B (en) Data-driven distributed traffic flow prediction method and system
Li et al. A DBN-based deep neural network model with multitask learning for online air quality prediction
CN112685153A (en) Micro-service scheduling method and device and electronic equipment
Sbai et al. A real-time decision support system for big data analytic: A case of dynamic vehicle routing problems
Xia et al. A parallel NAW-DBLSTM algorithm on Spark for traffic flow forecasting
Lv Construction of marine ship automatic identification system data mining platform based on big data
Zhao et al. Unifying Uber and taxi data via deep models for taxi passenger demand prediction
He et al. GLAD: A Grid and Labeling Framework with Scheduling for Conflict-Aware $ k $ k NN Queries
CN112967495B (en) Short-time traffic flow prediction method and system based on big data of movement track
Toader et al. A new modelling framework over temporal graphs for collaborative mobility recommendation systems
CN114916013B (en) Edge task unloading delay optimization method, system and medium based on vehicle track prediction
Sambo et al. Integration of GPS and satellite images for detection and classification of fleet hotspots
Wang et al. A Second-Order HMM Trajectory Prediction Method based on the Spark Platform.
CN110019343A (en) A kind of new energy meteorological data management method and system
CN114061604A (en) Passenger carrying route recommendation method, device and system based on movement track big data
Hu et al. Partition selection for large-scale data management using knn join processing
Heiler et al. Comparing implementation variants of distributed spatial join on spark

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination