CN111598311B

CN111598311B - Novel intelligent optimization method for train running speed curve

Info

Publication number: CN111598311B
Application number: CN202010349688.XA
Authority: CN
Inventors: 董海荣; 周学影; 周敏; 宋海锋; 袁磊
Original assignee: Beijing Jiaotong University
Current assignee: Beijing Jiaotong University
Priority date: 2020-04-28
Filing date: 2020-04-28
Publication date: 2022-11-25
Anticipated expiration: 2040-04-28
Also published as: CN111598311A

Abstract

The invention discloses a novel intelligent optimization method for a train running speed curve, which comprises the following steps: step one, building a train operation reinforcement learning environment; step two, establishing a reward mechanism; step three, updating a train operation historical information database; and step four, the intelligent agent interacts with the train operation reinforcement learning environment. The invention carries out multi-objective optimization aiming at the train running speed curve, the optimization objectives comprise train punctuality, energy consumption and comfort level, certain energy saving is realized under the condition that the train is as accurate as possible, and the passenger riding comfort level can be improved.

Description

Novel intelligent optimization method for train running speed curve

Technical Field

The invention belongs to the technical field of train operation optimization, and particularly relates to a novel intelligent optimization method for a train operation speed curve.

Background

The railway system has the characteristics of severe change of line environment, complex external influence factors, more cross-line long-distance cross-road operation, complex and various infrastructure and train characteristics and the like, so the train operation process is a nonlinear problem which is limited by various factors such as line conditions, speed limit and the like. Because different running speed curves have great influence on the aspects of train energy consumption, safety, accuracy and the like, the existing train running speed curve optimization algorithm is difficult to meet the requirements of quickly recovering running and ensuring efficient operation under complex line conditions, and cannot meet the requirements of train running optimization control on the aspects of real-time performance and the like under complex environments. In the existing train operation control system, when an emergency occurs and the train operation time needs to be adjusted, an ATO system (train automatic driving system) cannot automatically adjust a speed curve to ensure that a train arrives on time, but needs to be converted into a manual driving mode. With the development of computer computing power, artificial intelligence and the like in recent years, the search for efficient intelligent optimization methods has become a research hotspot. Therefore, how to combine a novel intelligent method to carry out rapid real-time optimization on the speed curve of the train and improve the operation indexes of the train, such as the accuracy, comfort, energy conservation and the like, is still a problem worthy of further thinking and research.

The train speed curve optimization problem is a multi-stage decision problem, the reinforcement learning model-free method shows the superiority and rapidity of approximate optimal solution search in the multi-stage decision problem, and reinforcement learning analysis data have the characteristics of self online learning and environmental feedback information insight, and how to achieve set targets can be learned in complex and uncertain environments. Therefore, the combination of the reinforcement learning method has important theoretical and practical significance for speed curve optimization.

Disclosure of Invention

The invention provides a novel intelligent optimization method for a train running speed curve, which aims to improve three performance indexes of a train, namely the accuracy, the comfort level and the energy conservation.

The technical scheme of the invention is as follows:

a novel intelligent optimization method for a train running speed curve comprises the following steps:

step one, building a train operation reinforcement learning environment:

establishing a train operation reinforcement learning environment in the first step according to the line static data, the train static data and the train operation dynamic data; the train running dynamic data comprises the current running position, the speed, the acceleration and the train running time of the train;

step two, establishing a reward mechanism:

through the reward function, the vehicle-mounted controller determines reward values corresponding to different working condition actions in each state; the reward function is set as a relevant function of train operation time, energy consumption and acceleration, and the reward value is provided by the train operation reinforcement learning environment in the step one;

step three, updating the train operation historical information database:

collecting train operation data information in a real railway scene, wherein the data information comprises the position, the speed, the acceleration value, the time, the line gradient and the line speed limit value of train operation to form a train operation state data set; aiming at the train running state data set, finding out a state data set with the maximum similarity through the Manhattan distance; combining the train running state data set and the corresponding action to form a train state-action data set, namely a train running track; the train operation historical information database is formed by a plurality of train operation tracks; processing the data in the train operation history information database according to the reward mechanism in the step two, namely calculating reward for each state-action pair in each operation track to obtain an updated train operation history information database for training neural network parameters;

step four, interaction between the intelligent agent and the train operation reinforcement learning environment:

the train operation reinforcement learning environment generates a new state, a new reward value and a new state value function and feeds the new state, the new reward value and the new state value function back to the intelligent agent; data obtained after real train operation historical data processing are stored in an experience playback data area; the intelligent agent continuously carries out strategy evaluation and strategy improvement through the action value function, selects the maximum action value function, feeds back the action corresponding to the maximum action value function to the train operation reinforcement learning environment, continuously updates the train working condition value through a closed loop structure, and finally selects the optimal working condition action to generate an optimal train speed curve;

and in the fourth step, the intelligent agent is equal to the train-mounted controller in the second step.

The invention has the beneficial effects that:

firstly, a reinforcement learning method is adopted, train parameters are trained by using train historical operation data, on one hand, the optimization of a speed curve does not depend on a specific train model, and the adverse effect of a complex and changeable train operation environment on solving is avoided; on the other hand, the method learns from real historical data, calculates a reward function, improves a learning model, and improves the solving speed and quality of the approximate optimal solution of the speed curve.

Secondly, the invention carries out multi-objective optimization aiming at the train running speed curve, the optimization objectives comprise train punctuality, energy consumption and comfort, certain energy saving of the train is ensured under the condition of as accurate point as possible, and the passenger riding comfort is improved.

Drawings

Fig. 1 is a graph of the shortest operation time of a novel intelligent optimization method for a train operation speed curve according to an embodiment of the present invention;

fig. 2 is a schematic diagram of agent-environment interaction of a novel intelligent optimization method for a train running speed curve according to an embodiment of the present invention;

fig. 3 is a structural diagram of a novel intelligent optimization method for a train operation speed curve according to an embodiment of the present invention.

Detailed Description

The present invention is further described below in conjunction with the drawings and the embodiments so that those skilled in the art can better understand the present invention and can carry out the present invention, and the embodiments of the present invention are not limited thereto.

In the first step, a train operation reinforcement learning environment is established, and data of the reinforcement learning environment comprises static data of lines and trains and dynamic data of train operation. The train running dynamic data comprises the current running position, speed, acceleration and train running time of the train. By using the data, constraint of the train running speed solution is given, and the solution space is reduced. As shown in fig. 1, a curve of the shortest operation time of the train under the conditions of maximum traction, maximum braking and section speed limit constraint is given, and the curve gives the maximum speed which can be reached by the train operation, and is used as a constraint condition when data sampling is performed by using similarity in the third step, so that a solution space which accords with the actual train operation is obtained.

In the second step, the invention aims to improve the train operation punctuality rate, reduce the train operation energy consumption and improve the riding comfort of passengers. The reward function is the optimization goal, and therefore, the reward function is set as a correlation function of train operation time, energy consumption and acceleration. Can be expressed as follows:

wherein, the reward function related to punctuality is designed as follows:

to ensure punctuality, at the end of each trial, an additional term will be added for the reward function regarding punctuality, namely:

the reward function related to energy consumption is designed as follows:

the comfort-related reward function is designed as follows:

wherein,

t _max respectively the actual time and the maximum time, T, spent by each run of the train for a step length _r T is actual running time and planned running time between stations respectively; u. of _i Δ d is the train acceleration and train position step length, respectively, N is the total number of steps, e _max For each run of one step lengthThe maximum energy consumed; Δ c _max And the maximum impact rate of the train operation is obtained.

If different performance indicators have different requirements, a weight may be set for each performance indicator. As follows:

r _i ＝w ₁ *r _i ^time +w ₂ *r _i ^energy +w ₃ *r _i ^comfort ，w ₁ +w ₂ +w ₃ ＝1

w ₁ ，w ₂ ，w ₃ respectively, corresponding weights for time, energy consumption and comfort function.

After the reward mechanism is established, the reward value of the train taking different actions in each state can be obtained. As shown in fig. 2, the acceleration, the time and the maximum time spent by the unit distance of train operation, the maximum energy consumption and the maximum impact rate of train operation in the reward function are obtained from the step one, i.e. the reward value is provided by the train operation reinforcement learning environment.

In the third step, the collected train operation data information includes train operation position, speed, acceleration and deceleration values and time. Each train operation state is set as follows:

wherein, d _i ，v _i ，t _i ，u _i ，g _i ，

The position, the speed, the time, the acceleration value, the line gradient and the line speed limit value of the train in the current i state are respectively.

The invention needs to generate data for strengthening learning training from the historical data information, namely, for the current train running state, a state set with the maximum similarity is found from the data sets, and the similarity between the data can be measured by adopting the Manhattan distance. Namely:

are respectively s _i And s _k The jth element in (a).

For each state s _i The method is adopted to find n nearest states { s } _k1 ,s _k2 ,...,s _kn Accordingly, state-action pairs(s) corresponding to the n states can be obtained _k ,a _k ). Wherein s is _k ，a _k Respectively representing for one state s _i The obtained approximate state set and corresponding action set correspond to the acceleration, and different accelerations correspond to different actions. It is noted that n is not fixed and is determined according to the constraints provided in step one.

The train trajectory can be described as a set of state-actions as follows:

τ＝{s ₀ ,a ₀ ,s ₁ ,a ₁ ,...,s _N-1 ,a _N-1 ,s _N }

the plurality of train running tracks form a train running history information database, namely:

M＝{τ ₁ ,τ ₂ ,...,τ _M }

the historical information database in the step comprises a large number of train running tracks, and the rewarding mechanism provided in the step two is adopted to calculate the rewarding of each state-action pair in each running track, so that the following data set is obtained:

τ′＝{s ₀ ,a ₀ ,r ₀ ,s ₁ ,a ₁ ,r ₁ ,...,s _N-1 ,a _N-1 ,r _N-1 ,s _N }

namely, the train history information database is updated to M' = { τ = ₁ ′,τ ₂ ′,...,τ′ _M }. The updated database information is increased by the reward value of the state-action pair compared with the original database information and is used for training the neural network in the next step four.

And step four is a core part of the invention, in the part, the intelligent agent and the reinforcement learning environment continuously carry out interaction and learning, and the optimal working condition action is selected and fed back to the train operation reinforcement learning environment through the evaluation and improvement of the strategy value function.

As shown in fig. 2, the intelligent agent is equivalent to a vehicle-mounted controller, in the running process of a train, the intelligent agent interacts with the environment, the environment generates a new state, a reward value and a state value function and feeds back the new state, the reward value and the state value function to the intelligent agent, the intelligent agent continuously performs strategy evaluation and strategy improvement through the value function, selects a maximum action value function, feeds back an action corresponding to the maximum action value function to the train running reinforcement learning environment, continuously updates a working condition value through a closed-loop structure, finally selects an optimal working condition action, generates an optimal train speed curve, and achieves the purpose of energy conservation and comfort.

The intelligent agent action value function is updated in a Deep Q learning Network (Deep-Q-Network) mode, and the updating method adopts a gradient descent method:

θ _- theta is a network parameter of the target network and a network parameter of the value function approximation respectively, a and a' are actions selected in the current state and the next state respectively, corresponding acceleration, r is an award value, gamma is a discount factor, and Q is a representative value function.

The basic procedure for optimizing the speed profile using DQN is as follows:

inputting: state S belongs to S, train action a belongs to A, value function v belongs to R, and mapping S multiplied by A → R is established

Initializing an empirical playback data zone D of capacity N

Initializing a state-action value function Q with a random weight θ

Let theta _- = theta, initialize target neural network

Beginning:

for the first training segment, epicode =1:

obtaining an initial state s ₁ ＝(d ₁ ,v ₁ ,u ₁ ,t ₁ ) (initial state is zero vector)

For t =1:

selecting N pieces(s) from historical information database _i ,a _i ,r _i ,s _i+1 ) Data storage into D

Sampling m training samples(s) from D _j ,a _j ,r _j ,s _j+1 )

Computing

Solving for (y) using a gradient descent algorithm _j -Q(φ _j ,a _j ；θ)) ²

After C, updating the target network weight theta _- ←θ

End of each intra-event loop

End inter-event cycling

Through the steps, the trained neural network parameters used for approximating the value function are finally obtained. By using the parameters, the train running state and the relevant line conditions can be input in the running process of the train, and an optimized speed curve is obtained.

It can be seen from the above that, unlike the previous DQN algorithm, the experience playback data field in the DQN algorithm of the present invention stores data obtained by processing actual train operation history data, as shown in fig. 3, instead of experience data generated by an enhanced learning environment. This data can be obtained from the onboard computer of the train. By adopting the processing mode, on one hand, the method is independent of a specific train dynamics model, adverse effects of a complex train operation environment on modeling solution are avoided, on the other hand, the method learns from real historical data, calculates a reward function, improves a learning model, and improves the solving speed and quality of a speed curve approximate optimal solution. The neural network trained by using the historical data can output the optimal action according to the current state, namely, the optimal train operation condition is output according to the current train operation state, and the purpose of energy conservation and comfort on the spot is achieved through the modes of off-line training and on-line optimization.

The above description is only a few examples of the present invention, and is not intended to limit the present invention. All the modifications and improvements made to the above examples according to the technical essence of the present invention fall within the scope of the present invention.

Claims

1. A novel intelligent optimization method for a train running speed curve comprises the following steps:

step one, building a train operation reinforcement learning environment:

step two, establishing a reward mechanism:

determining a reward value corresponding to different working condition actions adopted in each state by the vehicle-mounted controller through a reward function; the reward function is set as a relevant function of train operation time, energy consumption and acceleration, and the reward value is provided by the train operation reinforcement learning environment in the step one;

step three, updating the train operation historical information database: