CN115938104B

CN115938104B - Dynamic short-time road network traffic state prediction model and prediction method

Info

Publication number: CN115938104B
Application number: CN202111115375.9A
Authority: CN
Inventors: 任毅龙; 姜涵; 于海洋; 晁文杰
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2021-09-23
Filing date: 2021-09-23
Publication date: 2024-06-28
Anticipated expiration: 2041-09-23
Also published as: CN115938104A

Abstract

The invention discloses a depth deterministic strategy gradient algorithm optimization-based dynamic short-time road network traffic state prediction method, which comprises the steps of collecting and sorting data uploaded by a vehicle-mounted device to an upper system, constructing a KNN-based static prediction model, constructing and training a dynamic optimization part based on DDPG algorithm, and dynamically optimizing parameters through a depth reinforcement learning algorithm. The method considers the short-time traffic state expression of the vector type in the KNN prediction model, and gives the model greater flexibility in processing traffic state rapid change, conventional and non-conventional traffic evolution scenes. By dynamic optimization of DDPG algorithm, the short-time traffic state prediction model can be dynamically adjusted and predicted by converting simple static prediction which evolves along with time into learning traffic state evolution, so that the problems that static and semi-static models in the prior art can only fit historical data and rules and cannot be quickly adapted to sudden and random changes of real-time traffic states are solved, and prediction accuracy is further improved.

Description

Dynamic short-time road network traffic state prediction model and prediction method

Technical Field

The invention belongs to the field of traffic big data technology and application, relates to short-time road network traffic state prediction, and in particular relates to a dynamic short-time road network traffic state prediction model optimized based on a depth deterministic strategy gradient algorithm.

Background

The short-time road network traffic state prediction has important practical effects in both intelligent traffic systems and intelligent vehicle-road cooperative systems facing the future, and is a precondition for carrying out other real-time traffic services such as on-road path induction, path decision and the like. Therefore, the quality of many basic traffic services is affected by the accuracy of the prediction of the traffic state of the short-term road network.

In recent years, along with the diversified development of traffic detectors and the improvement of data storage devices, traffic data acquisition technology and related application research thereof have been greatly advanced. Correspondingly, the short-time road network traffic state prediction algorithm driven by traffic big data is also endless and mainly comprises a shallow learning (traditional machine learning) model represented by K nearest neighbor, a support vector machine, a decision tree and the like, and a deep learning model represented by long-time memory network, convolution neural network and combination modeling of the two.

However, these two types of models are often trained and built using robust historical data, i.e., determining the main architecture and super parameters of the model, and are no longer tuned with or only at certain periods of the model during application. The static and semi-static models can only better fit historical data and rules, but cannot quickly adapt to sudden and random changes of real-time traffic states and make corresponding adjustments in time.

Disclosure of Invention

In order to solve the technical problems, the invention provides a dynamic short-time road network traffic state prediction model and a prediction method,

The complete technical scheme of the invention comprises the following steps:

a dynamic short-time road network traffic state prediction method based on depth deterministic strategy gradient algorithm optimization comprises the following steps:

step one: data collection and processing

(1) The time, position and vehicle instantaneous speed information uploaded to a superior system by a vehicle-mounted GPS device in a specified time interval are collected,

(2) Calculating to obtain an average value v (t) of the vehicle speed on a certain road section l at the moment t, wherein the average values at the moment t and the moment before t are known values,

(3) Obtaining a corresponding average speed value set V _t＝ (v₁(t),v₂(t),…,v_n (t)) on a road section network formed by road sections l ₁,l₂,…,l_n, wherein n is the number of the road sections;

(4) Using time t, observations at delta-1 times prior to time t are aggregated into a spatio-temporal matrix X _t,X_t representing the traffic state at time t, X _t is as shown in (1):

(5) Processing X _t, calculating trend vector of each single reference state point And defines the traffic state unit X' _t in a vector manner, i.e

Wherein the method comprises the steps of

Reference state points herein refer to specific values at a certain moment in the spatiotemporal matrix X _t, which values are known.

Step two: construction of a KNN-based static predictive model

(1) Measuring the distance ED _i between the reference state points V _t by using Euclidean distance and measuring the trend vector by using cosine distanceThe distance CD _i between the two is measured, and the expression is:

In the method, in the process of the invention, For the i-th known reference state point data,The i-th known trend vector, where the expression with the subscript h is known historical data in the collected sample; and constructs a state distance SD _i to measure state cell similarity therefrom:

wherein u=1, 2, …, M is the number of historical samples, α is the coefficient of balance euclidean distance and cosine distance, and the value range is [0,1];

(2) Selecting K neighbors according to similarity measurement results

Calculating state distances between a sample X _t+1 to be predicted and all known historical samples, and taking K historical samples V _h,1,V_h,2,…,V_h,K with the smallest distance as neighbors;

(3) Calculating a predicted value of a sample to be predicted

Method for calculating predicted value using delta predictionThe label values of the neighbors, i.e. the state distances SD _i are gaussian weighted according to the distance size,

For X _t, the future state point X _t+1 at the time point (t+1) is recorded as y _t,

The delta is the difference between y _h,j and V _h,j for the nearest neighbor, and for the j-th nearest neighbor (j=1, 2, …, K) the expression is:

△y_h,j＝y_h,j-V_h,j(

6)

I.e., traffic state variables within a prediction window;

secondly, calculating a predicted value through Gaussian weighting Is that

In the weights

Step two, constructing a static prediction model based on KNN, which further comprises the following steps:

(4) Coarse calibration of model parameters

Calibrating undetermined parameters delta, K and alpha in the KNN-based static prediction model by utilizing a gridding search experiment of the acquired real data, wherein the calibration method specifically comprises the following steps:

Firstly, establishing an evaluation system of a prediction effect, wherein the evaluation system comprises a root mean square error MAE and an absolute value percentage error MAPE, and obtaining:

wherein N is the number of road segments, N is the number of samples to be predicted in the experiment, The predicted value and the true value of the ith sample to be predicted are respectively;

Secondly, discretizing the value range of the parameter to be calibrated, taking different parameter combinations one by one for experiment, and recording the experimental result;

and selecting the parameter combination with the best experimental result as a calibration value of the model parameters.

The method also comprises the following step three: the parameter alpha is dynamically optimized based on DDPG algorithm. The method specifically comprises the following steps:

The following definitions are made:

State S _t: the method comprises the steps of observing an external road network traffic state V _t and a prediction model self state P _t, namely S _t＝{V_t,P_t, wherein V _t is a state unit observed at a time t, and P _t is a residual error of the prediction model known to be predicted last time at the time t;

action a _t: the alpha value of the parameter selected in the decision is different from the coarse calibration of the KNN model parameter, and alpha is a continuous value in [0,1] and is not discretized.

Instant prize r _i: defining an average index lifting rate after executing the action a _t as a reward function by means of a coarse-calibrated static KNN prediction model, and when indexes obtained by executing the action a _t are smaller than indexes obtained by the coarse-calibrated model, then the action a _t is regarded as effective optimization; conversely, action a _t is said to be an ineffective optimization, and is said to be not fully effective when the index obtained by action a _t is greater than the index obtained by the coarse calibration model, but does not exceed 1% of the value. Accordingly, a bonus function is defined as:

Wherein MAE _t、MAE′_t is the mean absolute value error obtained by the static coarse calibration model and the selection action a _t model when predicting the state unit X _t, and similarly, the mean absolute value percentage error MAPE _t、MAPE′_t is included. Then when a _t is valid, it is known that r _t is positive; when a _t is inactive, r _t is known to be negative; when a _t is not fully effective, r _t is found to be 0.

The DDPG algorithm training flow is as follows:

(1) Initializing parameters

An Actor-Critic architecture is adopted in DDPG, the Actor is responsible for outputting actions, interacting with the environment and learning strategies, the Critic is responsible for evaluating the actions and improving the strategies, and specifically, the functions of the Actor and the Critic are realized by a neural network;

The original Actor and Critic networks are regarded as evaluation networks (Online networks), copies of the same Network structure are set to be called Target networks (Target networks), the evaluation networks interact with the environment each time, parameter values are updated, and the Target networks copy the parameter values of the evaluation networks at specified intervals;

Initializing estimated network parameters of an Actor and Critic, recording as θ ^Q and θ ^μ,θ^Q as the estimated network parameters of the Actor, and θ ^μ as the estimated network parameters of Critic, and copying to a target network, namely θ ^Q→θ^Q' and θ ^μ→θ^μ';

(2) Experience collection

Firstly, randomly selecting a time point to start prediction by utilizing historical data and a time axis thereof;

Recording the predicted time step as i, wherein i represents any position number, and recording the traffic state V _i of the predicted time step in the prediction process of time variation; calculating residual errors according to the prediction result which can be evaluated last time, and taking the residual errors as a prediction model state P _i; the parameter alpha used in the KNN prediction model is an action value a _i determined by the agent in the time step; evaluating the prediction effect according to the reward function to obtain r _i; observing and obtaining a state S _i+1 at the next time step, namely i+1; storing (S _i,a_i,r_i,S_i+1) as a record in a memory pool for later use;

Continuously repeating the prediction process according to a time axis to complete T predictions, and marking the prediction as a training round;

(3) Experience playback

Given a sample parameter threshold in a memory pool, when the number of sample records in the memory pool exceeds the threshold, a batch training evaluation network is randomly sampled from the sample records in the memory pool, and the evaluation network parameters are updated and recorded according to a certain gradient calculation method and a counter-propagation ruleBeing the gradient of the network parameter θ, the update is:

gradients for network parameters θ ^Q and θ ^μ;

when the time steps are appointed at intervals, the target network parameters are updated in a soft mode, specifically:

(4) The specified number of rounds is reached and the training is ended.

The deep neural network design mode in DDPG algorithm is as follows:

(1) The Actor network has two input interfaces for inputting road network traffic state V _t and predictive model state P _t, respectively. The Critic network also needs to input an action value, namely an output value of the Actor network, on the basis of inputting the traffic state of the road network and the state of the prediction model;

(2) The output value of the Actor network is an action value, namely, the parameter alpha in the prediction model is output, the output dimension is 1, and the output layer uses a sigmoid activation function in consideration of the value range of alpha; the output value of the Critic network is Q value, the output dimension is 1, and the linear activation function is used in the output layer because the value range is not defined clearly;

(3) Critic networks update network parameters using a method that minimizes a loss function, i.e., a gradient descent method, whose loss function is in the form of a mean square error, in particular

Wherein Q (s _i,a_i|θ^Q) is the target network output value, Q _i is the target value, and the calculation is based on state transition

q_i＝r_i+γQ′(_si₊₁,μ′(s_i+1|θ^μ')|θ^Q') (14)

Wherein Q 'is a target Q network, and mu' is a target strategy network;

Updating the Actor network is based on calculation of policy gradients, i.e

In order to lose the gradient,Representing the gradient operator, Q (s, a|θ ^Q) is the output of the online Q function at network parameters θ ^Q,s＝s_i,a＝μ(s_i).

The method also comprises the following steps: prediction experiment verification

In training, monitoring the effectiveness of the model in training by observing word predictive rewards or round progressive rewards; after training is completed, the relative quality of the model can be directly verified by calculating and comparing the two evaluation indexes.

The invention has the advantages compared with the prior art that:

(1) In the KNN prediction model, a short-time traffic state expression of a vector formula is considered, and a state distance measurement method of fusion of Euclidean distance and cosine distance is introduced, so that the model has higher flexibility in processing traffic state change emergency, conventional and non-conventional traffic evolution scenes.

(2) Through the dynamic optimization of DDPG algorithm, the short-time traffic state prediction model can be changed from the static prediction which is simple in evolution along with time to the prediction which is known about the traffic state evolution to dynamically adjust and predict, so that the prediction precision is further improved.

Drawings

FIG. 1 is a flow of model construction for a dynamic short-time road network traffic state prediction model and a prediction method according to the invention.

FIG. 2 is a diagram of the main model structure of the Agent and Actor-Critic in the DDPG algorithm of the present invention.

FIG. 3 is a round jackpot variation during training of the algorithm of the present invention DDPG.

Detailed Description

The invention is further described below with reference to the drawings and the detailed description.

As shown in fig. 1, the invention provides a dynamic short-time road network traffic state prediction model optimized based on a depth deterministic strategy gradient algorithm and a method for predicting the road dynamic short-time road network traffic state by using the model. The model mainly comprises a KNN-based static prediction model and a DDPG algorithm-based dynamic optimization part, and the specific implementation way of the prediction method comprises four steps: the method comprises the steps of data collection and processing, KNN-based static prediction model construction, DDPG algorithm-based dynamic optimization part construction and training and prediction experiment verification, and specifically comprises the following steps:

step one: data collection and processing

The present invention uses floating car data, which in this example is derived from the collection of taxi onboard GPS devices in beijing worker stadium area at month 2015, which upload real time, vehicle position, and instantaneous speed information to the upper system every 2 minute time interval. For a given known road network, the collected data may be used to characterize the traffic state of the road network at various times. Taking a vehicle speed value as an example, for the time t, an average speed value v (t) of the vehicle on a certain road section l can be calculated; the average speed value corresponding to the road network composed of the segments l ₁,l₂,…,l_n (n represents the number of segments, n=257 in the present embodiment is V _t＝(v₁(t),v₂(t),…,v_n (t)), and this set of speed values V _t is referred to as the road network traffic state at the time t. Because the short-time traffic state change has continuity, the trend of the traffic state change is difficult to be expressed by using the observed value at a single moment, so the invention uses the observed values at delta moments before the moment t to be aggregated into a space-time matrix to represent the traffic state at the moment t, as shown in a public expression (1):

wherein X _t is a space-time matrix formed by aggregating observed values of delta moments before a moment t, and v (t) is an average speed value of a vehicle at the moment t on a certain road section l; n represents the number of road segments.

In order to distinguish the traffic state point at time t from the concept V _t, X _t is a traffic state sequence at time t. For X _t, it is noted that the future state point at time (t+1) is y _t, and (X _t,y_t) is a set of sample pairs. The short-time traffic state prediction in the invention is based on X _t, and predicts y _t.

In the research, the form of the space-time matrix fully characterizes the road network traffic state of each time step, but has the following problems: on the one hand, the state evolution is not enough outstanding with the time change, and the multi-dimensional time sequence arrangement can blur some information to cause misjudgment, for example, the Euclidean distance from the origin to (1, 2,3, 4) and from the origin to (4, 3,2, 1) is the same in four-dimensional space; on the other hand, the space-time matrix mode increases the whole data dimension by aggregating a plurality of time step data, more resources are consumed in calculation, and even dimension disasters are induced, so that the accuracy of the result is reduced.

Thus, the present invention further processes X _t to change it to use a trend vector consisting of a single reference state point Vt and resulting VtThe representation defines the traffic state in a vector definition manner, simply characterizes the evolution condition of the traffic state and the evolution result thereof, and is called a traffic state unit, namely

Wherein the method comprises the steps of

Step two: static prediction model construction based on KNN

The KNN-based static short-time prediction model used in the invention mainly selects K most similar samples by calculating the similarity between a sample to be predicted and a known sample, which is called as a neighbor; and reasonably presuming the label of the sample to be predicted by utilizing the label of the neighbor, namely, completing the prediction. The method specifically comprises the following construction steps:

(1) Defining a similarity metric function

For two samples, the similarity of the two is often measured by calculating their distance: the smaller the distance between the two, the higher the similarity between the two is; otherwise, the lower the similarity. For a traffic state unit composed of two data forms, the invention defines a method for fusing two distance measurement modes, namely measuring the distance ED _i between reference state points by using Euclidean distance and measuring the distance CD _i between trend vectors by using cosine distance, namely

In the method, in the process of the invention,For the i-th known reference state point data,The i-th known trend vector, where the expression with the subscript h is known historical data in the collected sample; and constructs a state distance SD _i to measure state cell similarity therefrom:

wherein u=1, 2, …, M is the number of historical samples, α is the coefficient of balancing the euclidean distance and the cosine distance, and the value range is [0,1].

As can be seen from the formula (5) and the definition above, the value of α determines the tendency of the state distance to the two parts of the euclidean distance and the cosine distance, that is, when α→0, SD _i≈CD_i, which illustrates that the evolution trend of the state plays a decisive factor in the similarity measure of the state units, and this situation is often applicable to the situation that the traffic evolution trend features are very significant, for example, the traffic state has a short-time mutation or remains almost constant; when alpha is equal to 1, SD _i≈ED_i indicates that whether the state units are similar or not depends on the reference state point of the final result, but the evolution process is not important, and the situation can be often understood as the situation that the traffic evolution trend is more conventional.

(2) K neighbors are selected according to the similarity measurement result.

And calculating the state distance between the sample to be predicted and all known historical samples, and taking K historical samples V _h,1,V_h,2,…,V_h,K with the smallest distance as neighbors.

(3) And calculating a predicted value of the sample to be predicted.

The invention calculates the predicted value and uses the form of increment prediction, and carries out Gaussian weighting on the label value of the neighbor according to the distance.

First, define the delta between y _h,j and V _h,j for the nearest neighbor, then there is a difference for the j-th nearest neighbor (reorder footer j=1, 2, …, K)

△y_h,j＝y_h,j-V_h,j (6)

I.e. the amount of traffic state change within the prediction window.

Then, the predicted value is calculated by Gaussian weighting to be

Wherein, in view of the state distance value and for the sake of brevity, weights can be set

(4) And (5) roughly calibrating model parameters.

In order to realize the prediction function, undetermined parameters in the prediction model are also calibrated, including delta, K and alpha, in a specific mode, a gridding search experiment using real data is adopted. In this example, the K search interval is 5-120, and the interval is 5; the alpha search interval is 0-1, and the interval is 0.1.

Firstly, an evaluation system of the predictive effect should be established, including root mean square error (mean absolute error, MAE) and absolute percentage error (mean absolute percentage error, MAPE, i.e.)

Wherein N is the number of samples to be predicted in the experiment,The predicted value and the true value of the i-th sample to be predicted are respectively.

And secondly, discretizing the range of values of parameters to be calibrated, taking different parameter combinations one by one for experiment, and recording the experimental result.

And finally, selecting the parameter combination with the best experimental result as a calibration value of the model parameters. The calibration values at different prediction steps in this example are specifically as follows:

Step three: and dynamically optimizing part construction and training based on DDPG algorithm.

Through the second step, the prediction model based on KNN can complete the expected prediction task, but from the aspect of a parameter calibration method, the model can only conduct static prediction, and the static model ignores objective changes of short-time traffic flow and influences caused by the changes, which are reflected in parameters K and alpha closely related to time-varying traffic states in the model, and do not change adaptively in the prediction process. Therefore, the method is further based on DDPG algorithm dynamic optimization part construction and training, the method protected by the invention is not limited to the mode adopted by the embodiment, and a person skilled in the art can adopt other feasible modes to carry out dynamic optimization part construction and training, but the embodiment adopts the current more reasonable optimization method, and the method of the embodiment is specifically described below.

Calibration experiments prove that the Gaussian weighting method effectively suppresses the sensitivity of the K value to the effect, namely the prediction error almost always gradually decreases with the increase of K. In other words, if it is desired to achieve the goal of improving the accuracy by dynamically adjusting the parameters of the prediction model, only a larger value needs to be selected for the parameter K. However, for the parameter alpha, the calibration method is contrary to the original purpose of setting the parameter alpha, namely, the model is compatible with the conventional and unconventional changes of the short-time traffic flow through the adjustment of the parameter alpha, and the calibration of the general system does not have stronger flexibility and adaptability.

Therefore, the invention provides a method for dynamically optimizing the parameter alpha through a deep reinforcement learning algorithm. Reinforcement learning is a special machine learning algorithm, which takes a Markov Decision Process (MDP) as a basic modeling idea and mainly comprises the elements of a state S, an action a, a reward r and the like. And the Agent constructed by the reinforcement learning algorithm performs a series of interactions with the environment to complete a sequential decision process, and makes actions capable of maximizing rewards in different states in the environment through continuous self-learning. The depth deterministic strategy Gradient (DDPG) algorithm is a reinforcement learning algorithm combining deep learning and an Actor-Critic architecture, has the advantages of being capable of coping with continuous state space and continuous action space, and is more suitable for outputting high-dimensional continuous action real problems so as to improve the dynamic adaptability of a model to a complex environment and realize dynamic continuous decision. The method specifically comprises the following steps:

1. defining elements such as state S, action a, and prize r.

First, the problem definition is explicit, i.e. the dynamic optimization process is modeled with a markov decision process. In the rolling prediction process of the model, the value problem of the parameter alpha in each prediction process of the KNN-based prediction model is regarded as one time of Markov decision, the decision value in the past prediction is not depended, and the markov is satisfied only by the current observed road network traffic state and the prediction effect of the prediction model.

From the modeling described above, the following definitions can be made:

State S _t: the method comprises the steps of observing an external road network traffic state V _t and a prediction model self state P _t, namely S _t＝{V_t,P_t, wherein V _t is a state unit observed at the moment t, and P _t is a residual error of the last prediction known by the prediction model at the moment t.

(Immediate) rewards r _t: the average index boost rate after performing action a _t is defined as a reward function by means of a coarsely calibrated static KNN prediction model. Before giving this function, first a rule is defined how to evaluate the predictive effect qualitatively: when the indexes obtained by executing the action a _t are smaller than the indexes obtained by the coarse calibration model, the action a _t is called as effective optimization; otherwise, take action a _t is said to be an ineffective optimization. In order to accelerate algorithm convergence, the present invention chooses to be moderately tolerant, i.e., not fully effective when the index obtained by taking action a _t is greater than the index obtained by the coarse calibration model, but not more than 1% of its value. Accordingly, a bonus function is defined as:

Wherein MAE _t、MAE′_t is the mean absolute value error obtained by the static coarse calibration model and the selection action a _t model when predicting the state unit X _t, and similarly, the mean absolute value percentage error MAPE _t、MAPE′_t is included. Then when a _t is valid, it is known that r _t is positive; when a _t is inactive, r _t is known to be negative; when a _t is not fully effective, r _t is found to be 0. This means that when action a is performed _t

And 2, designing a DDPG algorithm training flow.

In combination with the application scenario of short-time traffic state prediction, and referring to a classical DDPG algorithm, as shown in fig. 2, the invention proposes a specific training flow as follows:

(1) Initializing parameters.

The agent in DDPG adopts an Actor-Critic architecture, and integrates two types of methods based on values and policies. The Actor is responsible for outputting actions, interacting with the environment and learning strategies, and the Critic is responsible for evaluating the actions and improving the strategies, and specifically, the functions of the Actor and the Critic are realized by the neural network. In order to improve the stability of the algorithm, the original Actor and Critic networks are regarded as evaluation networks (Online networks), copies of the same Network structure are set to be called Target networks (Target networks), the evaluation networks interact with the environment each time, namely parameter values are updated, and the Target networks copy the parameter values of the evaluation networks at specified intervals.

Therefore, the estimated network parameters of the Actor and Critic are initialized, denoted as θ ^Q and θ ^μ, and then copied to the target network, namely θ ^Q→θ^Q' and θ ^μ→θ^μ'.

(2) And (5) experience collection.

First, a time point is randomly selected to start prediction by using history data and a time axis thereof.

Recording the predicted time step as i, and recording the traffic state V _i of the predicted time step in the prediction process of time variation; calculating residual errors according to the prediction result which can be evaluated last time, and taking the residual errors as a prediction model state P _i; the parameter alpha used in the KNN prediction model is an action value a _i determined by the agent in the time step; evaluating the prediction effect according to the reward function to obtain r _i; in addition, the next time step, i.e., i+1, is observed for the state S _i+1. And storing (S _i,a_i,r_i,S_i+1) as a record in a memory pool for standby.

And continuously repeating the prediction process according to the time axis to finish T predictions, and recording the predictions as a training round. In this example, t=100 is taken.

(3) Experience playback.

As training proceeds, a relatively rich experience will be accumulated in the memory pool. Given a threshold delta, when the number of sample records in a memory pool exceeds delta, a batch training estimation network is randomly sampled from the memory pool, an estimation network parameter is updated according to a certain gradient calculation method and a counter-propagation rule, and the parameters of the estimation network are recordedThe gradient for the network parameter θ is updated as follows:

When the interval is designated in time steps, updating the target network parameters, wherein the updating method is soft updating, and specifically comprises the following steps:

(4) The designated round number M is reached, and the training is finished. In this example, m=4000 is taken.

Deep neural network design in DDPG algorithm.

The series of deep neural networks consisting of the evaluation network and the target network of the Actor and Critic are key parts of the strategy learned in DDPG algorithm. Since the target network is a copy of the evaluation network, i.e., the structural settings of the target network are consistent with the evaluation network, only the evaluation networks of the Actor and Critic need be designed as follows. Although Automated machine learning (Automated MACHINE LEARNING, AUTOML), and particularly neural network search technology (Neural Architecture Search, NAS), has received extensive attention and research in recent times, which has made it possible to automate the design of deep neural networks, certain manual regulations and restrictions should still be given in the design for the specific functions and structures of the network. In combination with the purpose and scene of using the Actor and Critic networks, namely considering network input and output, error function and the like, the invention has the following requirements on the design:

(1) The Actor network has two input interfaces for inputting road network traffic state V _t and predictive model state P _t, respectively. The Critic network also needs to input an action value, namely an output value of the Actor network, on the basis of inputting the traffic state of the road network and the state of the prediction model.

(2) The output value of the Actor network is an action value, namely a parameter alpha in the prediction model, so that the output dimension is 1, and the sigmoid activation function is used by the output layer in consideration of the value range of alpha. The output value of the Critic network is Q value, the output dimension is 1, and the linear activation function is used in the output layer because the value range is not clearly defined.

q_i＝r_i+γQ′(s_i+1,μ′(s_i+1|θ^μ')|θ^Q') (14)

Updating the Actor network is based on calculation of policy gradients, i.e

Step four: and (5) verifying a prediction experiment.

It is crucial to verify that the training and the resulting model are valid. The root mean square error MAE and the absolute value percentage error MAPE form a comprehensive evaluation index system, and the reward function is defined as the average lifting percentage of DDPG optimization on the KNN prediction model. Thus, in training, the effectiveness of the model in training may be monitored by observing word predictive rewards or round-robin jackpots; after training is completed, the relative quality of the model can be directly verified by calculating and comparing the two evaluation indexes.

This example illustrates the monitored cumulative meetings and rewards as shown in FIG. 3. It can be seen from the black trend line in the graph that when training is started, the agent obtains very low rewards, even negative values, and positive values after a few rounds, and although negative rounds still occur, the rewards can be completely guaranteed to be positive values along with gradual lifting of training, and the trend line value is about 100.

The foregoing description is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and any simple modification, variation and equivalent structural changes made to the above embodiment according to the technical substance of the present invention still fall within the scope of the technical solution of the present invention.

Claims

1. A dynamic short-time road network traffic state prediction method based on depth deterministic strategy gradient algorithm optimization is characterized by comprising the following steps:

step one: data collection and processing

(2) Calculating the average value v (t) of the vehicle speed on a certain road section l at the moment t, wherein the average value at the moment t and the moment before t is known,

(3) Obtaining a corresponding average speed value set V _t＝(v₁(t),v₂(t),…,v_n (t) on a road section network formed by road sections l ₁,l₂,…,l_n by using the method, wherein n is the number of the road sections;

(4) Using the collected data of delta-1 times before time t, calculating and aggregating into a space-time matrix X _t,X_t to represent traffic state at time t, X _t is as shown in (1):

(5) Processing X _t, calculating trend vectors for reference points for each single reference state point And defines the traffic state unit X' _t in a vector manner, i.e

Wherein the method comprises the steps of

Step two: construction of a KNN-based static predictive model

(2) Selecting K neighbors according to similarity measurement results

(3) Calculating a predicted value of a sample to be predicted

△y_h,j＝y_h,j-V_h,j (6)

I.e., traffic state variables within a prediction window;

Next, the predicted value at the time (t+1) in the future is calculated by Gaussian weighting Is that

In the weights

2. The method for predicting the traffic state of the dynamic short-time road network based on the optimization of the depth certainty strategy gradient algorithm according to claim 1, wherein,

The second step is as follows: the construction of the KNN-based static prediction model further comprises the following steps:

(4) Coarse calibration of model parameters

3. The method for predicting the traffic state of the dynamic short-time road network based on the optimization of the depth certainty strategy gradient algorithm according to claim 2, wherein,

The method also comprises the following step three: the dynamic optimization parameter alpha based on DDPG algorithm specifically comprises:

The following definitions are made:

Action a _t: the parameter alpha value selected in the decision is different from the KNN model parameter coarse calibration, alpha takes continuous values in [0,1], discretization is not carried out,

Instant prize r _t: defining an average index lifting rate after executing the action a _t as a reward function by means of a coarse-calibrated static KNN prediction model, and when indexes obtained by executing the action a _t are smaller than indexes obtained by the coarse-calibrated model, then the action a _t is regarded as effective optimization; conversely, action a _t is said to be an ineffective optimization, and when the index obtained by action a _t is greater than the index obtained by the coarse calibration model, but not more than 1% of the value thereof, it is said to be incompletely effective, and accordingly, the reward function r _t is defined as:

Wherein MAE _t、MAE′_t is the average absolute value error obtained by the static coarse calibration model and the selective action a _t model when the state unit X _t is predicted, and similarly MAPE _t、MAPE′_t is the average absolute value percentage error obtained by the static coarse calibration model and the selective action a _t model when the state unit X _t is predicted, when a _t is valid, r _t is known to be positive; when a _t is inactive, r _t is known to be negative; when a _t is not fully effective, r _t is found to be 0.

4. The method for predicting the traffic state of the dynamic short-time road network based on the optimization of the depth certainty strategy gradient algorithm according to claim 3, wherein,

The DDPG algorithm training flow is as follows:

(1) Initializing parameters

(2) Experience collection

(3) Experience playback

gradients for network parameters θ ^Q and θ ^μ;

(4) The specified number of rounds is reached and the training is ended.

5. The method for predicting the traffic state of the dynamic short-time road network based on the optimization of the depth certainty strategy gradient algorithm as set forth in claim 4, wherein,

The deep neural network design mode in DDPG algorithm is as follows:

(1) The Actor network is provided with two input interfaces which are respectively used for inputting a road network traffic state V _t and a prediction model state P _t, Critic network, and an action value, namely an output value of the Actor network is required to be input on the basis of inputting the road network traffic state and the prediction model state;

q_i＝r_i+γQ′(s_i+1,μ′(s_i+1|θ^μ')|θ^Q') (14)

Wherein Q 'is a target Q network, and mu' is a target strategy network;

Updating the Actor network is based on calculation of policy gradients, i.e

6. The depth deterministic policy gradient algorithm-optimized dynamic short-time road network traffic state prediction method according to claim 5, further comprising:

step four: prediction experiment verification

In training, monitoring the effectiveness of the model in training by observing word predictive rewards or round progressive rewards; after training, the relative quality of the model is directly verified by calculating and comparing the two evaluation indexes.