CN115938104A

CN115938104A - Dynamic short-time road network traffic state prediction model and prediction method

Info

Publication number: CN115938104A
Application number: CN202111115375.9A
Authority: CN
Inventors: 任毅龙; 姜涵; 于海洋; 晁文杰
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2021-09-23
Filing date: 2021-09-23
Publication date: 2023-04-07
Anticipated expiration: 2041-09-23

Abstract

The invention discloses a dynamic short-time network traffic state prediction method based on deep certainty strategy gradient algorithm optimization. According to the method, vector type short-time traffic state expression is considered in the KNN prediction model, and the model has higher flexibility in processing scenes of rapid and slow traffic state change, conventional traffic evolution and unconventional traffic evolution. Through dynamic optimization of the DDPG algorithm, the short-time traffic state prediction model can be dynamically adjusted and predicted by changing from simple static prediction along with time evolution into the insight traffic state evolution, so that the problem that static and semi-static models can only fit historical data and rules and cannot rapidly adapt to sudden and random changes of real-time traffic states in the prior art is solved, and the prediction precision is further improved.

Description

Dynamic short-time road network traffic state prediction model and prediction method

Technical Field

The invention belongs to the field of traffic big data technology and application, relates to short-time short-circuit network traffic state prediction, and particularly relates to a dynamic short-time short-circuit network traffic state prediction model optimized based on a deep certainty strategy gradient algorithm.

Background

The short-time road network traffic state prediction has important practical functions in an intelligent traffic system and a future-oriented intelligent vehicle-road cooperative system, and is a premise for performing other real-time traffic services such as on-road path guidance and path decision. Therefore, the quality of many basic traffic services is affected by the level of accuracy of the short-term network traffic state prediction.

In recent years, with the diversification development of traffic detectors and the promotion of data storage devices, the research on traffic data acquisition technology and related applications thereof has advanced greatly. Correspondingly, the short-time network traffic state prediction algorithm driven by traffic big data is also endless, and mainly comprises a shallow learning (traditional machine learning) model represented by K nearest neighbors, a support vector machine, a decision tree and the like, and a deep learning model represented by a long-time and short-time memory network, a convolutional neural network and a combination modeling of the long-time and short-time memory network and the convolutional neural network.

However, these two types of models are often trained and constructed with abundant historical data, i.e. the main architecture and hyper-parameters of the model are determined, and no longer adjusted with or only at certain periods during the application process of the model. The 'static' and 'semi-static' models can only better fit historical data and laws, but cannot adapt to sudden and random changes of real-time traffic states quickly and make corresponding adjustments in time.

Disclosure of Invention

In order to solve the above technical problems, the present invention provides a dynamic short-time network traffic status prediction model and prediction method,

the complete technical scheme of the invention comprises the following steps:

a dynamic short-time network traffic state prediction method based on deep certainty strategy gradient algorithm optimization comprises the following steps:

the method comprises the following steps: data collection and processing

(1) The time, the position and the vehicle instantaneous speed information uploaded to a superior system by a vehicle-mounted GPS device in a specified time interval are collected,

(2) Calculating to obtain the average value v (t) of the vehicle speed on a certain section l at the moment t, wherein the average values at the moment t and before the moment t are known values,

(3) Get a section of road l ₁ ,l ₂ ,…,l _n Formed road section network corresponding average speed value set V _t ＝ (v ₁ (t),v ₂ (t),…,v _n (t)), n is the number of road segments;

(4) Aggregating the observed values of delta-1 moments before the moment t into a space-time matrix X by utilizing the observed values of the moment t and the moment before the moment t _t ，X _t Indicating the traffic state at time t, X _t As shown in the publication (1):

/>

(5) To X _t Processing is carried out, and for each single reference state point, a trend vector of the reference state point is calculated

And defines traffic state unit X 'in vector mode' _t I.e. by

Wherein

The reference state point here refers to the space-time matrix X _t The value is known at a particular time.

Step two: construction of KNN-based static prediction model

(1) Respectively adopting Euclidean distance to reference state point V _t The distance ED between _i Performing a metric, and using the cosine distance to the trend vector

Distance between CD _i Performing measurement, wherein the expression is as follows:

in the formula (I), the compound is shown in the specification,

for the ith known reference status point data, <' >>

The ith known trend vector is represented by a subscript h in the formula as known historical data in the collected samples; and thus construct a state distance SD for measuring the similarity of state units _i ：

Wherein u =1,2, …, M and M are historical sample numbers, α is a coefficient for balancing euclidean distance and cosine distance, and the value range is [0,1];

(2) Selecting K neighbors according to similarity measurement results

Calculating a sample X to be predicted _t+1 And all known state distances between the historical samples, and K historical samples V with the minimum distance _h,1 ,V _h,2 ,…,V _h,K As neighbors;

(3) Calculating a prediction value of a sample to be predicted

Method for calculating predicted value by using incremental prediction

Tag value to neighbor, i.e. state distance SD _i The gaussian weighting is performed depending on the magnitude of the distance,

for X _t Recording the future state point X at the (t + 1) time _t+1 Is y _t ，

The increment is y of the nearest neighbor _h,j And V _h,j The difference between, for the j-th neighbor (j =1,2, …, K) the expression:

△y _h,j ＝y _h,j -V _h,j (6)

i.e. the traffic state variation within the prediction window;

secondly, calculating a predicted value by Gaussian weighting

Is composed of

In the formula, the weight

Step two, in the construction of the KNN-based static prediction model, the method further comprises the following steps:

(4) Coarse calibration of model parameters

Calibrating undetermined parameters delta, K and alpha in the KNN-based static prediction model by utilizing a gridding search experiment of the acquired real data, and specifically comprising the following steps:

firstly, establishing an evaluation system of a prediction effect, including a root mean square error MAE and an absolute value percentage error MAPE, to obtain:

wherein N is the number of road sections, N is the number of samples to be predicted in the experiment,

respectively obtaining a predicted value and a true value of the ith sample to be predicted;

secondly, discretizing the value range of the parameters to be calibrated, taking different parameter combinations one by one for carrying out experiments, and recording the experiment results;

and selecting the parameter combination with the optimal experimental result as the calibration value of the model parameter.

The method also comprises the third step: the parameter alpha is dynamically optimized based on the DDPG algorithm. The method specifically comprises the following steps:

the following definitions are made:

state S _t : including observed external road network traffic state V _t And predicting the model's own state P _t I.e. S _t ＝{V _t ,P _t In which V is _t For the state unit observed at time t, P _t The residual error of the last prediction is known for the prediction model at the moment t;

action a _t : the parameter alpha value selected in the decision-making is different from the rough calibration of KNN model parameter, and alpha is [0,1]]The continuous values in the inner are not discretized.

Instant reward r _i : defining the execution of action a by means of a coarsely scaled static KNN prediction model _t After averageIndex lifting rate as reward function when executing action a _t If the obtained indexes are all smaller than those obtained by the rough calibration model, the action a is called to be taken _t Is an effective optimization; otherwise, it is said to take action a _t Is an optimization of invalidity when taking action a _t If the obtained index is larger than the index obtained by the rough calibration model but not more than 1% of the value, the obtained index is called as not completely effective. Accordingly, the reward function is defined as:

wherein, MAE _t 、MAE′ _t Is paired with a state unit X _t Static rough calibration model and selection action a during prediction _t The mean absolute value error obtained by the model has the mean absolute value percentage error MAPE _t 、MAPE′ _t . When a is _t When effective, r is known _t Is positive; when a is _t When it is invalid, r is known _t Is negative; when a is _t When not completely effective, r is known _t Is 0.

The DDPG algorithm training process comprises the following steps:

(1) Initialization parameters

The DDPG adopts an Actor-Critic architecture, the Actor is responsible for outputting actions, interacting with the environment and learning strategies, the Critic is responsible for evaluating the actions and improving the strategies, and particularly, the functions of the Actor and the Critic are realized by a neural network;

the original Actor and Critic networks are regarded as estimation networks (Online networks), the same Network structure copy is set to be called as a Target Network (Target Network), the estimation Network updates the parameter value when interacting with the environment every time, and the Target Network copies the parameter value of the estimation Network at the specified interval;

firstly, the estimated network parameters of Actor and Critic are initialized and recorded as theta ^Q And theta ^μ ，θ ^Q To estimate network parameters, θ, for Actor ^μ Network parameters are estimated for Critic and copied to the target network, i.e., theta ^Q →θ ^Q’ And theta ^μ →θ ^μ’ ；

(2) Experience collection

Firstly, randomly selecting a time point to start prediction by using historical data and a time axis thereof;

recording the time step of the prediction as i, wherein i represents any position number, and recording the traffic state V of the time step of the prediction in the prediction process along with the time change _i (ii) a Calculating residual error as prediction model state P according to the prediction result which can be evaluated at the last time _i (ii) a The parameter alpha used in the KNN prediction model is the action value a decided by the intelligent agent at the time step _i (ii) a Evaluating the predicted effect according to the reward function to obtain r _i (ii) a The state S is observed at the next time step, i.e. at the moment i +1 _i+1 (ii) a Will (S) _i ,a _i ,r _i ,S _i+1 ) As a record, storing in a memory pool for later use;

continuously repeating the prediction process according to a time axis to finish T times of predictions, and recording as a training round;

(3) Empirical playback

Giving a threshold value of the sample parameters in the memory pool, randomly sampling a batch of training estimation networks from the memory pool when the number of the sample records in the memory pool exceeds the threshold value, updating the estimation network parameters and recording the parameters according to a certain gradient calculation method and a certain back propagation rule

For the gradient of the network parameter θ, the more recent is:

as a network parameter theta ^Q And theta ^μ A gradient of (a);

when the interval appoints a time step, the target network parameters are subjected to soft update, specifically:

(4) And finishing the training when the specified number of rounds is reached.

The deep neural network design mode in the DDPG algorithm is as follows:

(1) The Actor network has two input interfaces respectively used for inputting the traffic state V of the road network _t And predicting model state P _t . On the basis of inputting the traffic state of a road network and the state of a prediction model, the Critic network also needs to input an action value, namely an output value of an Actor network;

(2) The output value of the Actor network is an action value, namely a parameter alpha in a prediction model, the output dimension is 1, and the output layer uses a sigmoid activation function in consideration of the alpha value range; the output value of the Critic network is a Q value, the output dimension is 1, and the value range of the Critic network is not clearly defined, so that the output layer uses a linear activation function;

(3) Critic networks use a method of minimizing a loss function, i.e. a gradient descent method, to update network parameters, the loss function being in the form of a mean square error, in particular

Wherein, Q(s) _i ,a _i |θ ^Q ) Output value, q, for the target network _i Is a target value whose calculation is based on a state transition

q _i ＝r _i +γQ′( _s i ₊₁ ,μ′(s _i+1 |θ ^μ’ )|θ ^Q’ ) (14)

In the formula, Q 'is a target Q network, and mu' is a target strategy network;

updating of the Actor network is based on calculation of policy gradients, i.e.

/>

For a loss of gradient>

Representing a gradient operator, Q (s, a | θ) ^Q ) For online Q function at network parameter theta ^Q ，s＝s _i ,a＝μ(s _i ) The output of the time.

The method also comprises the following four steps: verification of prediction experiment

In training, monitoring the effectiveness of the model in training by observing word prediction rewards or turn cumulative rewards; after the training is finished, the relative quality of the model can be directly verified by calculating and comparing the two evaluation indexes.

Compared with the prior art, the invention has the advantages that:

(1) In the KNN prediction model, a vector type short-time traffic state expression and a state distance measurement method of Euclidean distance and cosine distance fusion are considered, and the model has higher flexibility in processing scenes of rapid traffic state change, conventional and non-conventional traffic evolution.

(2) Through dynamic optimization of the DDPG algorithm, the short-time traffic state prediction model can be dynamically adjusted and predicted by converting simple static prediction along with time evolution into insight traffic state evolution, and the prediction precision is further improved.

Drawings

Fig. 1 is a flow of model construction for the dynamic short-time network traffic state prediction model and the prediction method of the present invention.

FIG. 2 is a main model structure formed by Agent and Actor-Critic in DDPG algorithm.

FIG. 3 is a diagram of the change in the round jackpot during the training of the DDPG algorithm of the present invention.

Detailed Description

The invention is further described with reference to the following figures and detailed description.

As shown in FIG. 1, the invention provides a dynamic short-term network traffic state prediction model optimized based on a depth certainty strategy gradient algorithm, and a method for predicting a road dynamic short-term network traffic state by using the model. The model mainly comprises a KNN-based static prediction model and a dynamic optimization part based on a DDPG algorithm, and the specific implementation way of the prediction method comprises the following four steps: the method comprises the following steps of data collection and processing, KNN-based static prediction model construction, DDPG algorithm dynamic optimization part construction and training and prediction experiment verification, and specifically comprises the following steps:

the method comprises the following steps: data collection and processing

The invention uses floating car data, which in this example originates from a taxi-onboard GPS device collected in the beijing workers stadium area at 7 months in 2015, which uploads real-time, vehicle location, and instantaneous speed information to a superordinate system at 2 minute intervals. For a given known road network, the collected data can be used to characterize the traffic state of the road network at various times. Taking the vehicle speed value as an example, for a moment t, the average speed value v (t) of the vehicle on a certain road section l can be calculated; in the section of road l ₁ ,l ₂ ,…,l _n (n represents the number of links, n =257 in the present embodiment) the corresponding average speed value on the road network is V _t ＝(v ₁ (t),v ₂ (t),…,v _n (t)), this set of velocity values V is called _t The road network traffic state at time t. Because the short-time traffic state change has continuity, the trend of the traffic state change is often difficult to express by using the observation value at a single moment, so the observation values at delta moments before the moment t are aggregated into a space-time matrix to express the traffic state at the moment t, as shown in a publication (1):

in the formula, X _t Aggregating the observed values of delta moments before the moment t into a space-time matrix, wherein v (t) is the vehicle position on a certain road section lThe average speed value at the time t; n represents the number of links.

For distinguishing the above concepts, called V _t Is a traffic state point at time t, X _t Is a traffic state sequence at the time t. For X _t Recording the future state point at (t + 1) time as y _t And (X) _t ,y _t ) Is a set of sample pairs. Short-term traffic condition prediction in the present invention uses X _t According to y, to _t And (6) performing prediction.

In research, the form of the space-time matrix completely represents the traffic state of the road network at each time step, but the following problems exist: on one hand, the state evolution is not outstanding enough along with the time change, and the multi-dimensional time sequence arrangement may obscure some information to cause misjudgment, for example, the Euclidean distances from the origin to (1,2,3,4) and to (4,3,2,1) in the four-dimensional space are the same; on the other hand, in the form of the spatio-temporal matrix, by aggregating a plurality of time step data, the overall data dimension is multiplied, and more resources are consumed during calculation, even dimension disaster is induced, resulting in reduction of result precision.

Thus, the present invention is directed to X _t Further processing is performed to instead use a trend vector that yields Vt from a single reference state point Vt

The expression defines the traffic state in a vector defining mode, simply depicts the evolution situation of the traffic state and the evolution result thereof, called as a traffic state unit, namely

Wherein

Step two: KNN-based static prediction model construction

The static short-time prediction model based on KNN mainly selects K most similar samples called as neighbors by calculating the similarity between the samples to be predicted and known samples; and then reasonably speculating the label of the sample to be predicted by using the adjacent label, namely completing the prediction. The method specifically comprises the following construction steps:

(1) Defining a similarity metric function

For two samples, the similarity between them is often measured by calculating their distance: the smaller the distance between the two is, the higher the similarity between the two is; conversely, the lower the similarity. For a traffic state unit formed by two data forms, the invention defines a method for fusing two distance measurement modes, which respectively utilize Euclidean distance to the distance ED between reference state points _i Performing a measurement, and using the cosine distance to the distance CD between trend vectors _i Perform a measurement, i.e.

In the formula (I), the compound is shown in the specification,

for the ith known reference status point data, <' >>

The known trend vector with a subscript h in the formula is known historical data in the collected samples; and thus construct a state distance SD for measuring the similarity of state units _i ：

Wherein u =1,2, …, M and M are historical sample numbers, α is a coefficient for balancing euclidean distance and cosine distance, and the value range is [0,1].

From equation (5) and the above definitions, the value of α determines the tendency of the state distance to both Euclidean and cosine distances, i.e., SD → 0 _i ≈CD _i The evolution trend of the state plays a decisive factor in the similarity measurement of the state units, and the situation is usually suitable for the situation that the traffic evolution trend is very remarkable in characteristic, such as short-time sudden change or almost constant maintenance of the traffic state; when α → 1, SD _i ≈ED _i Whether the state units are similar or not is explained to be more dependent on the reference state point of the final result, and the evolution forming process is not important, and the situation can be generally understood as the situation that the traffic evolution trend is more conventional.

(2) And selecting K neighbors according to the similarity measurement result.

Calculating the state distance between the sample to be predicted and all known historical samples, and calculating K historical samples V with the minimum distance _h,1 ,V _h,2 ,…,V _h,K As neighbors.

(3) And calculating the predicted value of the sample to be predicted.

The invention calculates the predicted value by using an incremental prediction mode and carries out Gaussian weighting on the adjacent label value according to the distance.

First, y is defined to be the nearest neighbor in increments _h,j And V _h,j The difference between them is found for the j-th neighbor (reordering subscript j =1,2, …, K)

△y _h,j ＝y _h,j -V _h,j (6)

I.e., the amount of traffic state change within the prediction window.

Secondly, calculating a predicted value as

Wherein, in view of the magnitude of the state distance value and the purpose of simplification, the weight can be set

(4) And (5) roughly calibrating model parameters.

In order to realize the prediction function, undetermined parameters in the prediction model are calibrated, including delta, K and alpha, and the specific mode is a gridding search experiment using real data. The K search interval in this example is 5-120, with a spacing of 5; the alpha search interval is 0-1 and the spacing is 0.1.

First, an evaluation system for the prediction effect should be established, including the mean square error (MAE) and the mean square percent error (MAPE), i.e., MAPE

Wherein N is the number of samples to be predicted in the experiment,

respectively is the predicted value and the real value of the ith sample to be predicted.

And discretizing the value range of the parameters to be calibrated, taking different parameter combinations one by one to perform experiments, and recording the experiment results.

And finally, selecting the parameter combination with the optimal experimental result as the calibration value of the model parameter. The calibration values at different prediction steps in this example are specifically as follows:

step three: and constructing and training a dynamic optimization part based on the DDPG algorithm.

Through the second step, the prediction model based on the KNN can already complete the expected prediction task, but from the aspect of a parameter calibration method, the model can only carry out static prediction, and the influence caused by objective change and change of short-time traffic flow is ignored by the static model, which reflects that parameters K and alpha which are closely related to a time-varying traffic state in the model are not changed correspondingly any more in the prediction process. Therefore, the invention is further based on the dynamic optimization part construction and training of the DDPG algorithm, the method protected by the invention is not limited to the method adopted by the embodiment, and a person skilled in the art can adopt other feasible methods to carry out the dynamic optimization part construction and training, but the method adopted by the embodiment is a reasonable optimization method at present, and the method of the embodiment is specifically described below.

Calibration experiments prove that the sensitivity of the K value to the influence of the result is effectively inhibited by the Gaussian weighting method, namely the prediction error can be almost always gradually reduced along with the increase of the K. In other words, if it is desired to achieve the purpose of improving accuracy by dynamically adjusting the parameters of the prediction model, the parameter K only needs to be selected with a larger value. However, for the parameter α, the calibration method is contrary to the original purpose of setting the parameter α, that is, the model is compatible with regular and irregular changes of the short-term traffic flow by adjusting the parameter α, and the general calibration method does not have strong flexibility and adaptability.

Therefore, the invention provides a method for dynamically optimizing the parameter alpha through a deep reinforcement learning algorithm. Reinforcement learning is a special machine learning algorithm, which takes a Markov Decision Process (MDP) as a basic modeling idea and mainly includes elements such as a state S, an action a, and a reward r. An Agent constructed by a reinforcement learning algorithm carries out a series of interactions with the environment to complete a sequential decision process, and through continuous self-learning, the Agent can carry out actions capable of maximizing rewards in the face of different states in the environment. The Deep Deterministic Policy Gradient (DDPG) algorithm is a reinforcement learning algorithm combining Deep learning and an Actor-critical architecture, has the advantages of being capable of dealing with continuous state space and continuous action space, and more suitable for the reality problem of outputting high-dimensional continuous actions, so that the dynamic adaptability of the model to a complex environment is improved, and dynamic continuous decision is realized. The method specifically comprises the following steps:

1. elements such as state S, action a, reward r and the like are defined.

First, a problem definition is made explicit, i.e. the dynamic optimization process is modeled with a markov decision process. In the process of rolling prediction of the model, the value problem of the parameter alpha in each prediction process in the prediction model based on the KNN is regarded as a Markov decision, the decision value does not depend on the decision value in the past prediction, but only depends on the road network traffic state observed at present and the prediction effect of the prediction model, and the Markov property is met.

From the modeling described above, the following definitions can be made:

state S _t : including observed external road network traffic state V _t And predicting the model's own state P _t I.e. S _t ＝{V _t ,P _t In which V is _t For the state unit observed at time t, P _t The residual of the last prediction is known to the prediction model for time t.

(immediate) award r _t : defining the execution of action a by means of a coarsely scaled static KNN prediction model _t The latter average index lift rate is used as a reward function. Before the function is given, firstly, a rule for qualitatively evaluating the prediction effect is defined: when action a is performed _t If the obtained indexes are all smaller than those obtained by the rough calibration model, the action a is called to be taken _t Is an effective optimization; otherwise, it is said to take action a _t Is an inefficient optimization. To speed up algorithm convergence, the present invention chooses to be gracefully tolerated, i.e., when action a is taken _t If the obtained index is larger than the index obtained by the rough calibration model but not more than 1% of the value, the obtained index is called as not completely effective. Accordingly, the reward function is defined as:

wherein, MAE _t 、MAE′ _t Is paired with a state unit X _t Static rough calibration model and selection action a during prediction _t The mean absolute value error obtained by the model has the mean absolute value percentage error MAPE _t 、MAPE′ _t . When a is _t When effective, r is known _t Is positive; when a is _t When it is invalid, r is known _t Is negative; when a is _t When not completely effective, r is known _t Is 0. This means that when action a is performed _t

And 2. Designing a DDPG algorithm training flow.

In combination with the application scenario of short-term traffic state prediction, referring to the classic DDPG algorithm, as shown in FIG. 2, the invention proposes a specific training process as follows:

(1) And initializing parameters.

The agent in the DDPG adopts an Actor-criticic framework, and integrates two methods of value-based and strategy-based. The Actor is responsible for outputting actions, interacting with the environment and learning strategies, the Critic is responsible for evaluating the actions and improving the strategies, and particularly, the functions of the Actor and the Critic are realized by a neural network. In order to improve the stability of the algorithm, an original Actor and a Critic Network are regarded as estimation networks (Online networks), a copy of the same Network structure is set to be called a Target Network (Target Network), the estimation networks interact with the environment every time, namely, parameter values are updated, and the Target Network copies the parameter values of the estimation networks at specified intervals.

Therefore, the estimated network parameters of Actor and Critic are initialized and recorded as θ ^Q And theta ^μ Then copied to the target network, i.e. theta ^Q →θ ^Q’ And theta ^μ →θ ^μ’ 。

(2) And (4) collecting experiences.

First, a time point is randomly selected to start prediction using the history data and the time axis thereof.

If the predicted time step is recorded as i, the traffic state V of the predicted time step can be recorded in the prediction process along with the change of time _i (ii) a Calculating residual error as prediction model state P according to the prediction result which can be evaluated at the last time _i (ii) a The parameter alpha used in the KNN prediction model is the action value a decided by the intelligent agent at the time step _i (ii) a Evaluating the predicted effect according to the reward function to obtain r _i (ii) a In addition, the known state S is observed at the next time step, i.e., at time i +1 _i+1 . Will (S) _i ,a _i ,r _i ,S _i+1 ) As a record, stored in the memory pool for use.

And continuously repeating the prediction process according to a time axis to finish T times of prediction, and recording as a training round. In this example, T = 100.

(3) And (4) performing experience playback.

As the training progresses, a greater amount of experience is accumulated in the memory pool. Giving a threshold value delta, randomly sampling a batch of training estimation networks from the memory pool when the number of sample records in the memory pool exceeds delta, updating estimation network parameters and recording the parameters according to a certain gradient calculation method and a certain back propagation rule

For the gradient of the network parameter θ, the update formula is:

/>

when the time step is appointed at the interval, the target network parameters are updated, the updating method is soft updating, and the method specifically comprises the following steps:

(4) And finishing the training when the specified number M of rounds is reached. In this example, take M =4000.

Deep neural network design in DDPG algorithm.

The estimation network of Actor and Critic and a series of deep neural networks formed by the target network are key parts of the strategy learned in the DDPG algorithm. Since the target network is a copy of the estimation network, that is, the structure of the target network is consistent with the estimation network, the estimation network of Actor and Critic only needs to be designed as follows. Although Automated Machine Learning (AutoML), and in particular Neural network Search technology (NAS), has recently received extensive attention and research to provide the possibility of Automated design of deep Neural networks, certain manual rules and limitations should still be given to the specific functions and structures of the networks in the design. In combination with the purpose and the scene of using the Actor and the Critic network, namely considering network input and output, an error function and the like, the invention makes the following requirements on the design:

(1) The Actor network has two input interfaces for inputting the traffic state V of the road network _t And predicting model state P _t . On the basis of inputting the traffic state of the road network and the state of the prediction model, the criticic network also needs to input an action value, namely an output value of the Actor network.

(2) The output value of the Actor network is an action value, namely a parameter alpha in the prediction model, so that the output dimension is 1, and the sigmoid activation function is used by an output layer in consideration of the value range of the alpha. The output value of the Critic network is a Q value, the output dimension is 1, and the value range of the Critic network is not clearly defined, so that the output layer uses a linear activation function.

q _i ＝r _i +γQ′(s _i+1 ,μ′(s _i+1 |θ ^μ’ )|θ ^Q’ ) (14)

Updating of the Actor network is based on calculation of policy gradients, i.e.

Step four: and (5) verifying by a prediction experiment.

It is crucial to verify that the training and the derived model are valid. The root mean square error MAE and the absolute value percentage error MAPE form a relatively comprehensive evaluation index system, and the reward function is defined as the average lifting percentage of the DDPG optimization on the KNN prediction model in the two indexes. Thus, in training, the effectiveness of the model in training can be monitored by observing word prediction rewards or turn accumulation rewards; after the training is finished, the relative quality of the model can be directly verified by calculating and comparing the two evaluation indexes.

This example is illustrated with the monitored running total and rewards as shown in FIG. 3. It can be seen from the black trend line in the figure that, when training is started, the intelligent agent obtains a low reward, even a negative value, and the reward is positive after several rounds, although a round with a negative value still appears, the reward value can be completely guaranteed to be positive as the training is gradually promoted, and the value of the trend line is about 100.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and all simple modifications, changes and equivalent structural changes made to the above embodiment according to the technical spirit of the present invention still fall within the protection scope of the technical solution of the present invention.

Claims

1. A dynamic short-time network traffic state prediction method based on deep certainty strategy gradient algorithm optimization is characterized by comprising the following steps:

the method comprises the following steps: data collection and processing

(1) Collecting the time, position and vehicle instantaneous speed information uploaded to the superior system by the vehicle-mounted GPS device in a specified time interval,

(2) Calculating to obtain the average value v (t) of the vehicle speed on a certain section l at the moment t, wherein the average values at the moment t and before the moment t are known,

(3) By the above method, a route section l is obtained ₁ ,l ₂ ,…,l _n Formed road section network corresponding average speed value set V _t ＝(v ₁ (t),v ₂ (t),…,v _n (t)), n is the number of road segments;

(4) Calculating and aggregating the acquired data at the time t and delta-1 times before the time t into a space-time matrix X _t ，X _t Indicating the traffic situation at time t, X _t As shown in the publication (1):

(5) To X _t Processing is performed to calculate a trend vector for generating a reference point for each single reference state point

And defines a traffic state unit X 'in a vector mode' _t I.e. by

Wherein

Step two: construction of KNN-based static prediction model

in the formula (I), the compound is shown in the specification,

for the ith known reference status point data,/>>

(2) Selecting K neighbors according to similarity measurement results

Calculating a sample X to be predicted _t+1 And all known state distances among the historical samples, and K historical samples V with the minimum distance _h,1 ,V _h,2 ,…,V _h,K As neighbors;

(3) Calculating a prediction value of a sample to be predicted

Computing predicted values using a method of incremental prediction

Tag value to neighbor, i.e. state distance SD _i Gaussian weighting according to distance>

For X _t Remember itFuture state point X at time (t + 1) _t+1 Is y _t ，

The increment is y of the nearest neighbor _h,j And V _h,j The difference between them, for the j-th neighbor (j =1,2, …, K) is expressed as:

△y _h,j ＝y _h,j -V _h,j (6)

namely the traffic state variation in the prediction window;

secondly, the predicted value at the (t + 1) moment in the future is calculated through Gaussian weighting

Is composed of

In the formula, weight

2. The method for predicting the traffic state of the dynamic short-time network based on the deep deterministic strategy gradient algorithm optimization of claim 1,

the second step is as follows: in constructing the static prediction model based on the KNN, the method further comprises the following steps:

(4) Coarse calibration of model parameters

Calibrating undetermined parameters delta, K and alpha in the KNN-based static prediction model by utilizing a gridding search experiment of the acquired real data, wherein the parameters are as follows:

3. The method for predicting traffic state of short-time dynamic short-circuit network based on deep deterministic strategy gradient algorithm optimization of claim 2,

the method also comprises the following third step: the parameter alpha is dynamically optimized based on the DDPG algorithm. The method specifically comprises the following steps:

the following definitions are made:

Instant award r _t : defining the execution of action a by means of a coarsely scaled static KNN prediction model _t The average index lifting rate is used as a reward function when the action a is executed _t If the obtained indexes are all smaller than those obtained by the rough calibration model, the action a is called to be taken _t Is an effective optimization; otherwise, it is said to take action a _t Is an optimization of invalidity when taking action a _t If the obtained index is larger than the index obtained by the rough calibration model but not more than 1% of the value, the obtained index is called as not completely effective. Accordingly, a reward function r is defined _t Comprises the following steps:

wherein, MAE _t 、MAE′ _t Are respectively paired with state unit X _t Static rough calibration model and selection action a during prediction _t Mean absolute error of model acquisition, similarly MAPE _t 、MAPE′ _t Are respectively paired with state unit X _t Static rough calibration model and selection action a during prediction _t The mean absolute value percentage error the model takes. When a is _t When effective, r is known _t Is positive; when a is _t When it is invalid, r is known _t Is negative; when a is _t When not completely effective, r is known _t Is 0.

4. The method for predicting traffic state of short-time dynamic short-circuit network based on deep deterministic strategy gradient algorithm optimization of claim 3,

the DDPG algorithm training process comprises the following steps:

(1) Initialization parameters

An Actor-Critic architecture is adopted in the DDPG, the Actor is responsible for outputting actions, interacting with the environment and learning strategies, the Critic is responsible for evaluating the actions and improving the strategies, and particularly, the functions of the Actor and the Critic are realized by a neural network;

firstly, initializing the estimated network parameters of Actor and Critic, and recording the parameters as theta ^Q And theta ^μ ，θ ^Q To estimate network parameters, θ, for Actor ^μ Network parameters are evaluated for criticic and copied to the target network, i.e., theta ^Q →θ ^Q’ And theta ^μ →θ ^μ’ ；

(2) Experience collection

(3) Empirical playback

For the gradient of the network parameter θ, the more recent is:

as a network parameter theta ^Q And theta ^μ A gradient of (a);

(4) And finishing the training when the specified number of rounds is reached.

5. The method for predicting traffic state of short-time dynamic short-circuit network based on deep deterministic strategy gradient algorithm optimization of claim 4,

the deep neural network design mode in the DDPG algorithm is as follows:

(1) The Actor network has two input interfaces for inputting the traffic state V of the road network _t And predicting model state P _t . On the basis of inputting the traffic state of a road network and the state of a prediction model, the Critic network also needs to input an action value, namely an output value of an Actor network;

Wherein, Q(s) _i ,a _i |θ ^Q ) For the target network output value, q _i Is a target value whose calculation is based on a state transition

q _i ＝r _i +γQ′(s _i+1 ,μ′(s _i+1 |θ ^μ’ )|θ ^Q’ ) (14)

In the formula, Q 'is a target Q network, and mu' is a target strategy network;

updating of Actor networks is based on calculation of policy gradients, i.e.

For a loss of gradient>

6. The method for predicting the traffic state of the dynamic short-time network based on the deep deterministic strategy gradient algorithm optimization of claim 5, further comprising:

step four: verification of prediction experiment

Monitoring the effectiveness of the model in training by observing word prediction rewards or turn accumulation rewards during training; after training is finished, the relative quality of the model is directly verified by calculating and comparing the two evaluation indexes.