Summary of the invention
It can not reflect for above-mentioned degree of fitting big to the error of public transport arrival time prediction and its prediction in the prior art
Public transport arrive at a station truth the problem of, the present invention in proposing a kind of public transport arrival time prediction technique based on GRU neural network,
Specific technical solution is as follows:
A kind of public transport arrival time prediction technique based on GRU neural network, which comprises
S1, historical data is exported by database to CSV formatted file, obtain initial data, utilize HBase distributed data
Library and Spark memory processing technique are analyzed and processed the promiscuity, complicated of the removal initial data to the initial data
Property and coefficient;
S2, based on single attribute and multiple-factor angle using feature correlation organon processing analysis treated the original
Beginning data obtain standard time series categorical data;
S3, variables choice is carried out to the standard time series categorical data using Lasso method, when rejecting the standard
Between the weak feature vector of relevance in sequence type data;
The weak feature vector of relevance has been rejected in S4, the prediction model that arrives at a station based on the building public transport of GRU neural network, input
The standard time series categorical data is realized and is operated to the time prediction that public transport is arrived at a station to the prediction model that arrives at a station.
Further, step S1 includes:
S11, the CSV formatted file is obtained from HDFS using SparkSQL, forms Spark DataFrame structure number
According to;
S12, the history GPS track data that specified public transport is extracted using SparkSQL, and utilize HBase distributed data base
The history GPS track data are matched with bus station distance.
Further, described to utilize HBase distributed data base by the history GPS track data and bus station distance
It is matched, comprising:
S121, one particular value of setting are used to judge whether the matching to be less than the specified arrival location of public transport, if described
The result matched is less than the particular value, then marks public transport arrival location corresponding with the matching;
S122, two GPS positioning points for taking time interval to be greater than t seconds are appointed into the matching in chronological order, according to two
The slope of anchor point line judges the uplink and downlink operation conditions of public transport;
S123, positioning time nearest with website in the matching, the speed of service and acceleration based on public transport, note are chosen
Record arrival time;
S124, the initial data is ranked up with arrival time and public transport corresponding vehicle number, and defeated using Spark
It stores out into HDFS.
Further, the public transport arrival location is counted at a distance from actual location place by Greate-Circle distance
It calculates formula to calculate, the Greate-Circle distance calculation formula are as follows:
Wherein, R is earth radius, Aj, AwThe respectively longitude, latitude in actual location place;Bj, BwRespectively public transport is arrived
It stands longitude, the latitude in place.
Further, the calculating of the slope formula are as follows:
In formula, Dlon、DlatRepresent route uplink terminus longitude, latitude, Slon、SlatRepresent route uplink inception point warp
Degree, latitude, Alon、AlatRepresent latter station longitude, the latitude of rear vehicle driving trace, Blon、BlatRepresent previous station longitude, latitude
Degree;Wherein, if K > 0, then it represents that with it is in the same direction for uplink, i.e. uplink is on the contrary then be downlink.
Further, step S223 passes through formulaWherein, s is that the last anchor point is leaving from station
Point distance, v0For the running velocity of public transport at the public transport arrival location, vtFor speed of arriving at a station, it is the last fixed for being defaulted as 0, t
Time used in site to bus station.
Further, the Lasso method defined formula are as follows:Its
In, xijIt is row vector β for regression coefficient, y indicates training label for i-th group of j variable.
Public transport arrival time prediction technique based on GRU neural network of the invention, first by Spark to initial data
Process handles to obtain standard time series categorical data, realizes and arrives at a station the extractions of data to public transport;Then it is mentioned using Lasso method
The weak feature vector of relevance realizes variables choice operation out;Finally mould is predicted using GRU neural network arriving at a station for public transport of building
Type is realized and is operated to the specific time prediction that public transport is arrived at a station;Compared with prior art, GRU neural network of the present invention has logarithm
According to the operating process screened and selected, by arriving at a station the screening and selection of data to public transport, the method for the present invention can be mentioned effectively
Rise the accuracy predicted public transport arrival time.
Specific embodiment
In order to enable those skilled in the art to better understand the solution of the present invention, below in conjunction in the embodiment of the present invention
Attached drawing, technical scheme in the embodiment of the invention is clearly and completely described.
Refering to fig. 1, in embodiments of the present invention, a kind of public transport arrival time prediction based on GRU neural network is provided
Method, specifically includes that steps are as follows:
Step 1: exporting historical data to CSV formatted file by database, initial data is obtained, HBase distribution is utilized
Database and Spark memory processing technique to initial data be analyzed and processed removal the promiscuity of initial data, complexity and
Coefficient;In conjunction with Fig. 2, specifically, database is the historical record data for storing public transport real time execution, wherein historical record number
It is recorded and is obtained together by GPS according to (i.e. historical data), and since initial data is remembered by the GPS instrument that is mounted in public transport
There is the problems such as receiving precision delay and in public transport actual moving process in record, direct received data be possible to by
GPS location precision and network influence, be present in reference format be not inconsistent, data apparent error, Data duplication the problems such as;It is based on
This, the method for the present invention obtains original CSV formatted file first with SparkSQL from HDFS, forms Spark DataFrame
The data of format extract operation to redundancy, abnormal data, and delete to redundant columns using time series, license number matching
The number for removing, finally data being ranked up according to time, sequence of cars, and completed cleaning using HBase Phonenix interface
According to storing into database;Followed by the history GPS track data for specifying public transport in HBase, pass through Spark elasticity distribution formula number
History GPS track data are matched with bus station distance according to collection technology;In the matching process comprising steps of
A particular value is first set for judging whether matching is less than the setting arrival location of specified public transport, if matched result
Less than particular value, then public transport arrival location corresponding with matching is marked;Public transport arrival location is led at a distance from actual location place
Cross the calculating of Greate-Circle distance calculation formula, Greate-Circle distance calculation formula are as follows:Wherein, R is earth radius, Aj, AwRespectively
Longitude, the latitude in actual location place;Bj, BwThe respectively longitude, latitude of public transport arrival location;It will match again in chronological order
Appoint the corresponding anchor point of two GPS track data for taking time interval to be greater than t seconds, is judged according to the slope of two anchor point lines
The uplink and downlink operation conditions of public transport;Wherein, the calculating of slope formula are as follows:In formula, Dlon、
DlatRepresent route uplink terminus longitude, latitude, Slon、SlatRepresent route uplink inception point longitude, latitude, Alon、AlatIt represents
Latter station longitude, the latitude of rear vehicle driving trace, Blon、BlatRepresent previous station longitude, latitude;In checkout result, if K > 0,
Then indicate with it is in the same direction for uplink, i.e. uplink is on the contrary then be downlink;Then, positioning nearest with website in matching is chosen
Time, the speed of service and acceleration based on public transport record arrival time, especially by formulaMeter
Calculation obtains arrival time, and in formula, s is the last anchor point point distance leaving from station, v0For public transport at the public transport arrival location
Running velocity, vtFor speed of arriving at a station, 0, t is defaulted as the time used in the last anchor point to bus station;Finally, to arrive
Stand the time and the corresponding vehicle number of public transport be ranked up initial data, and using the output of Spark memory processing technique store to
In HDFS;Meanwhile by finding corresponding GPS track data pair using map to site name and position in site information table
The coordinate position answered analyzes its corresponding station spacing according to its operating line, forms site information table.
In a particular embodiment, if it exists the corresponding locating point position of a plurality of history GPS track data and bus station away from
From matched data, then screened with nearest, earliest for essential condition, selection obtains best matching result;Wherein, of the invention
Table one is seen using the format of initial data;The arrival time table of public transport sees table two;The specific website information table of public transport
See table three.
Table one
Table two
Table three
Step 2: using feature correlation organon processing analysis, that treated is former based on single attribute and multiple-factor angle
Beginning data obtain standard time series categorical data;In a particular embodiment, consider from single attribute, each shift station
Service time between point necessarily affects the arrival time of the next stop, and in practical driving procedure, different vehicle is due to driving
There is also certain changing rules for the person's of sailing difference, and consider existing connection between multiple-factor, website spacing and when dispatching a car
Between and whether be the traffic-operating periods feature such as working day, the efficiency of operation of whole route is necessarily affected, so as to cause arrival time
Variation, time series relationship existing for combined data script processes data into standard by feature correlation organon
Time series type;In the present invention, method is from two angle analysis different times of transverse and longitudinal of time and space and weather feelings
Influence of the condition difference for public transport arrival time specifically sees four content of table.
Table four
Step 3: carrying out variables choice to standard time series categorical data using Lasso method, standard time sequence is rejected
The weak feature vector of relevance in column categorical data;Being arrived at a station due to prediction public transport is a kind of regression problem in actual operation,
In order to avoid due in regression analysis process predicted vector it is excessive, the calculating process for causing subset to select has not practicability,
And subset selection has inherent discontinuity, it is extremely changeable so as to cause subset selection;By the present invention in that with Lasso method
Variables choice is carried out, the weak feature vector of relevance is rejected, Lasso method defined formula are as follows:In formula, xijFor i-th group of j variable, vector β is regression coefficient, and y is indicated
Training label;In conjunction with Fig. 3, the detailed process of the weak feature vector of correlation is rejected using Lasso method are as follows:
It analyzes to obtain the coefficient of different attribute firstly, carrying out variables choice using Lasso method by specified programming language
Value, the variables choice coefficient of Lasso method see table five;The specific implementation program code of Lasso method in the present embodiment are as follows:
Table five
Then, according to its relative coefficient, specified attribute outputting and inputting as prediction model is selected, it is preferred that this
Embodiment selects BUSNO, STOP, WEEKDAY, and DISTANCE, STARTTIME, input of the WEATHER attribute as model will
Arrival time (STOPTIME) is exported as model;Certainly, this is only the preferred embodiment of the method for the present invention, in other embodiments
In, it can be selected according to the actual situation, the present invention is not limited to this and fixed.
In actual operation, when increasing data volume due to after regression analysis pre-processes, needing to look to the future in data,
The inconsistent problem of dimension, it is therefore desirable to operation are standardized to data, the expression formula for having dimension is transformed to nondimensional
Expression formula;In this regard, the present invention is defined using class label, it is assumed that 10 vehicle license numbers are indicated with 0~9;It is marked using zero-mean
Standardization is defined as,In formula, x indicates former fixed type data, and x* indicates that new data, μ indicate sample average, σ
Indicate sample standard deviation;And deviation standardization, defined formula are utilized for his data are as follows:In formula,
Y indicates standard value, and x indicates former characteristic value;The benefit that data become scalar is had from there through normalization, searching can be effectively reduced
The time of optimal solution, the convergence rate and its precision of prediction of lift scheme, the contribution phase that each feature can be allowed to make result
Together;Solve the problems, such as new data dimension difference;The forecasting efficiency and precision of prediction of the method for the present invention can be promoted.
Step 4: based on GRU neural network building public transport the prediction model that arrives at a station, input rejected the weak feature of relevance to
The standard time series categorical data of amount is realized and is operated to the time prediction that public transport is arrived at a station to the prediction model that arrives at a station;In conjunction with Fig. 4,
It can be seen that GRU neural network possesses resetting door and updates two doors of door, and GRU neural network will not control and retain inside
Remember Ct;The principle of GRU neural network are as follows: firstly, updating door when time step is t, pass through formula zt=σ (W(z)xt+U(s)
ht-1) update door is calculated, in formula, xtFor t-th of component of list entries x, pass through a linear transformation and weight matrix W(z)
It is multiplied, ht-1The information for saving previous time step, by weight matrix U(s)Carry out linear transformation;Update goalkeeper this two
Partial information is added, and is converted using Sigmoid activation primitive, activation result is compressed between 0 to 1;Door is updated to determine
By historical data number pass to future, reduce the risk that gradient disappears;Resetting door determines the forgetting process of data, leads to
Cross formula rt=σ (W(r)xt+U(z)ht-1) indicate;Similar to update door, the letter that the component of list entries and back are saved
Breath carries out linear transformation, carries out transformation output finally by Sigmoid activation primitive.
Then, in use, new content will use the data in the history step of resetting door storage to resetting door, specifically
Can by formula h 't=tan h (Wxt+rt⊙Uht-1) be calculated, wherein input quantity xtWith the information h of backt-1It first passes through
Linear transformation processing, i.e., the right side multiplies matrix W, U respectively;Since resetting door is one by 0 to 1 vector, its value measurement, which gates, is opened
The size opened;When the corresponding gate value of some element is 0, then having meant that this element will be lost in this step by network
Forget, resets door r by calculatingtAnd Uht-1Hadamard product, can determine the information content to be retained or be forgotten;Finally
Two parts computer is crossed into addition investment tanh activation primitive tanh.
Finally, calculating the final memory h of GRU neural network current time stept, especially by formula: ht=zt⊙ht-1+
(1-zt)⊙h′tIt calculates, htInformation required for active cell will be retained and pass to next unit, used update herein
The activation result z of doort, to determine current memory content h 'tWith back information ht-1The middle information for needing to collect;Wherein ztWith
ht-1The previous time step of Hadamard product representation remain into the information finally remembered, which remains into plus current memory
The information finally remembered can calculate the content of final gating cycle unit output.
In a particular embodiment, the built-in protection of every layer of GRU neural network and the update door of its state is controlled, for realizing
Parameter sharing and circulation memory;Especially by the function being added for realizing exponential damping learning rate, and using under Adam gradient
Drop method, specifically, Adam gradient descent method is to single order momentum index rolling average calculation formula are as follows:
Wherein mtRepresent single order momentum, vtGeneration
Table second order momentum, β1、β2, represent objective function immediately, in the stage of primary iteration, two momentum have the offset to initial value,
That is mt=0, vt=0;Therefore, formula can be passed through to itIt is biased correction, and uses formulaGradient is updated;Compared to the prediction model constructed based on LSTM, the method for the present invention based on
The prediction model overall structure of arriving at a station of GRU neural network is simpler, and when front and back gradient direction is consistent, can speed up
It practises;When front and back gradient direction is inconsistent, it is able to suppress oscillation, cost module is used to calculate predicted value and the loss of true value is poor
It is different, based on the next step optimal way of the obtained loss diversity judgement GRU neural network, and determine the optimization side of gradient
To;Save module guarantees that the safety of model can be by mould that is, after being trained using a model for storage model parameter
Type completely saves, and on the one hand realizes the continuous preservation of data, on the other hand, can use guarantor during predict next time
The model deposited is realized to the optimization of entire prediction process steps, is conducive to the forecasting efficiency for promoting the method for the present invention.
Refering to Fig. 5, in embodiments of the present invention, the process of the prediction model of GRU neural network building are as follows:
Choose hyper parameter first: preferred, the invention of this reality is 0.1 to be just distributed very much to initialize weight as standard deviation, just
Beginningization deviation is 0.1, and initial learning rate is 0.001, attenuation coefficient 0.9, the rate of decay 1000, training dataset
Batch_size is 800, and all sample training number Epoch are 30, and time step Timesteps is 30.
Then model training is carried out: it is preferred, specifically, the present embodiment was gone through using Nantong Area No. 41 bus 14 days
History data of arriving at a station are analyzed, and take training set of preceding 10 day data as the prediction model that arrives at a station, using quadratic loss function it
=σ (Wi·[ht-1,xt]+bi) minimum error function as the model training, and using rear 4 day data in 14 days as inspection
Test the test verify data of model training result;Formula can specifically be announced
It indicates, in formula, C is quadratic loss function value, and x is input value, and y (x) is
The true value of arrival time, a are the corresponding output valve i.e. predicted value for inputting x and obtaining, and n indicates once trained total amount of data.It is real
In the application of border, over-fitting in order to prevent, and preferably reduce error, so that model is studied in depth, is added in loss function
L2 regular terms, ω indicate weight, and λ is for weighing quadratic loss function and weight this two relative importance.
By the public transport constructed the present invention is based on GRU neural network arrive at a station prediction model and tradition based on LSTM building prediction
Model carries out loss late comparison, refering to Fig. 6, it can be seen that, the method for the present invention rapid decrease before four iteration, and five
It tends towards stability after secondary, shows that the prediction model that arrives at a station of the method for the present invention building has been subjected at this time and train up, i.e., the present invention can
To complete the forecast function of model in the case where frequency of training is few, predetermined speed of entire model is effectively improved, is integrally mentioned
Rise forecasting efficiency.
Refering to Fig. 7, by the practical arrival time pair of the public transport arrival time predicted by the method for the invention and public transport
Than specifically, being different from mean absolute percentage error MAPE index, present invention employs formulasIt is fixed
The linear regression fit degree index R-squared of justice judges, wherein y indicates practical arrival time, y* expression arrival time
Based on GRU neural network building prediction model predicted value of arriving at a station,Represent average value;And according to formulaCalculate the quasi- of 3 days all shifts of the prediction model fitting of arriving at a station constructed based on GRU neural network
Right index R-squared, then be averaged, show that the degree of fitting of the prediction model that arrives at a station based on the building of GRU neural network reaches
To 94.547%, by practical arrival time compared with the predicted time of the prediction model that arrives at a station constructed based on GRU neural network
It is found that the prediction result of the method for the present invention is close with the practical arrival time of public transport, error is smaller.
Again by the method for the present invention and prediction model degree of being fitted and performance comparison based on LSTM building, refering to table six,
It can be seen that the method for the present invention compared to it is traditional based on LSTM construct prediction model, GM11 algorithm and SVM algorithm come
It says, degree of fitting is promoted obvious, i.e., the precision of prediction of surface the method for the present invention is higher than traditional public transport and arrives at a station precision of prediction;Refering to figure
8 and Fig. 9, therefrom can be with compared with the method for the present invention is carried out ten training with traditional LSTM prediction model in combination with table seven
Find out, howsoever take epoch and batchsize that can find, the time-consuming of the method for the present invention is fewer than LSTM, in epoch number
When for 100, batchsize being 300, the average time-consuming of LSTM network has had more 7.168% compared to GRU network, in epoch number
When for 300, batchsize being 3000, the average time-consuming of LSTM network has been higher by 14.1% compared to GRU network;With this it is found that
In the case where data volume constantly becomes more, calculating money can be more saved using the prediction model that arrives at a station that GRU neural network constructs
Model training the time it takes, the operation efficiency of lift scheme are reduced in source.
Table six
Table seven
In summary, the public transport arrival time prediction technique of the invention based on GRU neural network, passes through Spark first
It handles to obtain standard time series categorical data to initial data process, realizes and arrive at a station the extractions of data to public transport;Then it utilizes
Lasso method proposes that the weak feature vector of relevance realizes variables choice operation;Finally utilize the building public transport of GRU neural network
Arrive at a station prediction model, realizes and operates to the specific time prediction that public transport is arrived at a station;Compared with prior art, GRU nerve net of the present invention
Network has the operating process being screened and selected to data, by arriving at a station the screening and selection of data to public transport, side of the present invention
Method can effectively promote the accuracy to the prediction of public transport arrival time.
The foregoing is merely a prefered embodiment of the invention, is not intended to limit the scope of the patents of the invention, although referring to aforementioned reality
Applying example, invention is explained in detail, for a person skilled in the art, still can be to aforementioned each specific
Technical solution documented by embodiment is modified, or carries out equivalence replacement to part of technical characteristic.All utilizations
The equivalent structure that description of the invention and accompanying drawing content are done directly or indirectly is used in other related technical areas, together
Reason is within the invention patent protection scope.