CN115496201A - Train accurate parking control method based on deep reinforcement learning - Google Patents

Train accurate parking control method based on deep reinforcement learning Download PDF

Info

Publication number
CN115496201A
CN115496201A CN202211084513.6A CN202211084513A CN115496201A CN 115496201 A CN115496201 A CN 115496201A CN 202211084513 A CN202211084513 A CN 202211084513A CN 115496201 A CN115496201 A CN 115496201A
Authority
CN
China
Prior art keywords
network
train
strategy
stop
deep
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211084513.6A
Other languages
Chinese (zh)
Inventor
张磊
张建国
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Thales Sec Transportation System Ltd
Original Assignee
Thales Sec Transportation System Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Thales Sec Transportation System Ltd filed Critical Thales Sec Transportation System Ltd
Priority to CN202211084513.6A priority Critical patent/CN115496201A/en
Publication of CN115496201A publication Critical patent/CN115496201A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B61RAILWAYS
    • B61LGUIDING RAILWAY TRAFFIC; ENSURING THE SAFETY OF RAILWAY TRAFFIC
    • B61L27/00Central railway traffic control systems; Trackside control; Communication systems specially adapted therefor
    • B61L27/60Testing or simulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/40Business processes related to the transportation industry

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Business, Economics & Management (AREA)
  • Mechanical Engineering (AREA)
  • Economics (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Train Traffic Observation, Control, And Security (AREA)

Abstract

The application relates to a train accurate parking control method based on deep reinforcement learning, which comprises the steps of collecting ATO stop data and preprocessing the ATO stop data to obtain an expert data set X; performing simulated learning of behavior cloning based on the expert data set X, and initializing a strategy network of the train; training the strategy network on line by a depth determination strategy gradient method to obtain a depth optimization network; and outputting and storing the deep optimization network, and using the deep optimization network for accurate train stopping control. The following technical effects can be achieved: ATO historical data of an existing line is fully utilized, and simulation learning is performed offline; decoupling the coupling relation between the PI control parameters and the vehicle characteristics by means of the generalization capability of the neural network; the technology can reduce the real vehicle online debugging time of ATO software and improve the stop precision; the strategy network and the value network of reinforcement learning can be adjusted along with the change of train characteristics, and the effect of lifetime learning is realized.

Description

Train accurate parking control method based on deep reinforcement learning
Technical Field
The disclosure relates to the technical field of rail transit, in particular to a train accurate parking control method, device and system based on deep reinforcement learning.
Background
The subway plays an important role in urban public transport, the intelligent degree is higher and higher, and the proportion of full-automatic unmanned lines is gradually increased. In the unmanned line, because no driver is equipped, the stop precision requirement on the ATO vehicle control is higher. In order to make the train stop accurately, optimization of a train control algorithm is always the research focus of ATO software.
The train control algorithm is generally tested on a train simulation model, and after the expected effect is achieved, a real train test is performed. The train is a nonlinear dynamic system which consists of complex technical equipment, runs in a complex environment and has complex space-time distribution characteristics. For the final stop phase, the train braking will switch from electric to air braking, which is a large hysteresis, nonlinear, somewhat random system.
A target curve is typically designed using classical control methods while a tracking controller is designed to achieve autonomous driving. Tracking controllers often employ classical PI control algorithms to track target curves. The PI control algorithm performs better when in traction or electric braking; however, in the low-speed air braking stage, due to the characteristics of large delay, nonlinearity, randomness and the like of air braking, the actual speed of the train cannot be attached to a target curve according to expectation, and the stop precision range of the train is large.
The conventional method for train simulation is to simplify a train into a linear dynamics system and describe the system through a dynamics equation so as to meet the requirement of control algorithm research. Because the simulation model is greatly simplified, the simulation model is difficult to simulate the real performance of the brake system when the train finally stops, and the stop control algorithm cannot be effectively simulated on the simulation model. After the stop control algorithm passes the simulation model test, a large amount of real vehicle debugging is still needed to improve the algorithm and set the algorithm parameters. And real vehicle debugging relates to multi-professional cooperation, the workload is large, and the improvement of the algorithm is not friendly. In addition, the effect of the existing stop control algorithm is related to the vehicle characteristics and the route conditions, and the existing route algorithm parameters are generally required to be debugged again after being transferred to another new route.
The control process of a common train comprises four stages of starting acceleration, cruising, coasting and braking and stopping. The train stop precision is only related to braking, and how to control the train before braking basically has no influence on the stop precision. When the train is braked at a high speed, the electric braking system plays a role, and the air brake does not work; during low-speed braking, the electric brake does not work, and the air brake plays a role. The electric brake system has fast response and good consistency and can well follow the control instruction of the PI controller; the performance of the air brake system is slow in response and high in randomness, and the PI controller cannot well control the air brake system, so that the stop precision is greatly deviated.
Disclosure of Invention
In order to solve the problems, the application provides a train accurate parking control method, a train accurate parking control device and a train accurate parking control system based on deep reinforcement learning.
On one hand, the application provides a train accurate parking control method based on deep reinforcement learning, which comprises the following steps:
collecting ATO stop data, and preprocessing to obtain an expert data set X;
performing simulated learning of behavior cloning based on the expert data set X, and initializing a strategy network of the train;
training the strategy network on line by a depth determination strategy gradient method to obtain a depth optimization network;
and outputting and storing the deep optimization network, and using the deep optimization network for accurate train stopping control.
As an optional embodiment of the present application, optionally, the ATO station stop data is collected and preprocessed to obtain an expert data set X, including:
collecting ATO stop data of all existing lines;
according to a preset data preprocessing rule, carrying out data cleaning and segmentation on the ATO stop data to obtain ATO stop preprocessing data;
and extracting data of the last brake phase of each ATO stop from the ATO stop preprocessing data, and selecting data with stop tracks meeting expectations to form an expert data set X.
As an optional embodiment of the present application, optionally, performing a mimic learning of behavior cloning based on the expert data set X, and initializing a policy network of a train includes:
defining a loss function, and calculating errors of the strategy network output and the actual expert actions in the expert data set X by using the loss function; wherein the loss function is:
Figure BDA0003834931350000031
the loss function L (s, a; theta) is a single sample error and is used for describing an output value generated by the strategy network with the parameter theta in the input state s and an error which is the output a in the state s in the expert database;
(symbol)
Figure BDA0003834931350000032
the expression "defined as" means that the left expression is defined by the right expression;
[π(s;θ)-a] 2 the square of the error between the output of the strategy network pi (s; theta) and the actual expert action a is multiplied by 1/2 to eliminate the constant term of 2 when the gradient descent is derived;
defining a cost function, calculating the sum of the loss functions of all sample points of a whole track in the expert database X by using the cost function, and accumulating the losses of all tracks to obtain a total cost function J (theta); wherein the cost function is:
Figure BDA0003834931350000033
updating the neural network parameter theta by using the following gradient descent formula until the algorithm converges:
Figure BDA0003834931350000034
wherein:
θ now : current neural network parameters;
θ new : updated neural network parameters;
beta: the learning rate is a hyper-parameter and can be set;
Figure BDA0003834931350000035
cost function J (theta) at theta now A gradient value of (d);
initializing the policy network using the neural network parameter θ.
As an optional embodiment of the present application, optionally, the method for online training the policy network by a depth determination policy gradient method to obtain a depth optimization network includes:
a preset depth determination strategy gradient method;
establishing a deep optimization framework of a strategy network and a value network based on the depth determination strategy gradient method to obtain a deep optimization network; wherein, in the deep optimization network:
taking the state s of a train reinforcement learning framework as the input of the strategy network, and outputting an action a by the strategy network;
taking the state s of the train reinforcement learning frame and the action a output by the strategy network as the input of a value network, and outputting a value q (s, a; w) by the value network;
and optimizing and updating the deep optimization network, so that a strategy network and a value network of the deep optimization network can learn along with the characteristics of the train.
As an optional embodiment of the present application, optionally, the depth determination policy gradient method is a depth determination policy gradient algorithm based on an Actor-Critic framework.
As an optional embodiment of the present application, optionally, the updating of the policy network employs a round update of monte carlo, and the updating of the policy network employs a time-series differential update.
As an optional implementation of the present application, optionally, the method for updating the policy network of the deep optimization network includes:
in the final stage of train stop brake, the train stop is controlled by a strategy network pi (s; theta) with fixed parameters; the strategy network controls the complete track of the train movement to have T steps, and the instant return of each step before the train stops is 0, namely: r is a radical of hydrogen 1 =r 2 =…=r T-1 =0;
Judging whether the train stop precision D is within the index range:
if the train stop precision D is within the index range, the final step of the track (Tracjectory) obtains the reward of 100- | D |, namely: r is T =100-|D|;
If the train stop precision D is not within the index range, the final step of the track (track) obtains the reward of- (100 + | D |), namely: r is a radical of hydrogen T =-(100+|D|);
If the train stop precision D is within the index range, the accumulated reward from each step to the track end is as follows: u. of 1 =u 2 =…=u T =100-|D|;
If the train stop precision D is not within the index range, the accumulated reward from each step to the track end is as follows: u. of 1 =u 2 =…=u T =-(100+|D|);
Calculating a cost function:
Figure BDA0003834931350000051
updating parameters:
Figure BDA0003834931350000052
wherein:
w now : current neural network parameters;
w new : updated neural network parameters;
alpha is a learning rate which is a super parameter and can be set;
Figure BDA0003834931350000053
cost function J (w) is at w now The gradient value of (d).
As an optional implementation of the present application, optionally, the method for updating the value network of the deep optimization network includes:
fixing a value network parameter w, and controlling the train stop by using a strategy network;
by back propagation, the strategy gradient is calculated:
Figure BDA0003834931350000054
gradient ascent update parameters:
Figure BDA0003834931350000055
wherein:
θ now : current neural network parameters;
θ new : updated neural network parameters;
the learning rate is a hyper-parameter which can be set;
Figure BDA0003834931350000056
operator g (theta) at theta now The gradient value of (a).
On the other hand, the application provides a device for implementing the train accurate parking control method based on deep reinforcement learning, which includes:
the ATO data acquisition module is used for collecting ATO stop data and carrying out pretreatment to obtain an expert data set X;
the initialization module is used for simulating and learning behavior cloning based on the expert data set X and initializing a strategy network of the train;
the network optimization module is used for training the strategy network on line by a depth determination strategy gradient method to obtain a depth optimization network;
and the train control module is used for outputting and storing the deep optimization network, and using the deep optimization network for accurate train stopping control.
In another aspect of the present application, a control system is further provided, including:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to execute the executable instructions to implement the method for controlling train accurate parking based on deep reinforcement learning.
The invention has the technical effects that:
according to the method, an expert data set X is obtained by collecting ATO stop data and preprocessing the ATO stop data; performing simulated learning of behavior cloning based on the expert data set X, and initializing a strategy network of the train; training the strategy network on line by a depth determination strategy gradient method to obtain a depth optimization network; and outputting and storing the deep optimization network, and using the deep optimization network for accurate train stop control. The following technical effects can be achieved:
ATO historical data of an existing line is fully utilized, and simulation learning is performed offline; decoupling the coupling relation between the PI control parameters and the vehicle characteristics by means of the generalization capability of the neural network;
the technology can reduce the online debugging time of the ATO software real vehicle and improve the stop precision; the strategy network and the value network of reinforcement learning can be adjusted along with the change of train characteristics, and the effect of lifetime learning is realized.
Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features, and aspects of the disclosure and, together with the description, serve to explain the principles of the disclosure.
FIG. 1 is a schematic flow chart of an implementation of the train precise parking control method based on deep reinforcement learning of the invention;
FIG. 2 is a control schematic of the on-board controller and the control interface of the train of the present invention;
FIG. 3 is a schematic diagram of the reinforcement learning framework of the present invention;
FIG. 4 is a schematic diagram of the architecture of the policy network of the train of the present invention;
FIG. 5 is a schematic diagram of the deep optimization architecture of the present invention.
Detailed Description
Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.
The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.
Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present disclosure.
Example 1
First, the control principle of the train needs to be clarified. Fig. 2 is a schematic diagram of a control interface between a vehicle-mounted controller and a train. The vehicle-mounted controller outputs a traction command and a braking command to the train to request traction or braking; the vehicle-mounted controller outputs 8-bit digital quantity to a D/A converter, and the D/A converter converts the 8-bit digital quantity into a current signal of 4-20 mA and outputs the current signal to a train. The vehicle-mounted controller fuses information such as a speed sensor, an acceleration sensor and a positioning antenna, calculates train positioning in real time, and obtains gradient and speed limit information from a line database in real time according to the positioning. The speed, acceleration and other real-time calculated state information obtained by the vehicle-mounted controller through the sensor in real time are used as input information of the ATO module, and the input information is used for real-time control of the train.
The invention provides a deep reinforcement learning algorithm, which improves the robustness of a control model, reduces the real vehicle debugging time and improves the ATO stop precision. On the basis, an expert data set X is utilized, an offline simulation learning method is also provided, and good ATO historical inbound data is simulated through behavior cloning to initialize a behavior strategy network; and then optimizing the strategy network on line through reinforcement learning. Specifically, the method comprises the following steps:
as shown in fig. 1, in one aspect, the present application provides a train precise parking control method based on deep reinforcement learning, including the following steps:
s1, collecting ATO stop data, and preprocessing to obtain an expert data set X;
based on a large amount of good ATO inbound historical data, a behavior cloning method is adopted to initialize a policy network, and a deep deterministic policy gradient algorithm model is further adopted to optimize the behavior policy network. As an optional embodiment of the present application, optionally, the ATO station-stop data is collected and preprocessed to obtain an expert data set X, including:
collecting ATO stop data of all existing lines;
according to a preset data preprocessing rule, carrying out data cleaning and segmentation on the ATO stop data to obtain ATO stop preprocessing data;
and extracting data of the last brake phase of each ATO stop from the ATO stop preprocessing data, and selecting data with stop tracks meeting expectations to form an expert data set X.
At present, trains are controlled by the existing lines through a PI control algorithm, and a large amount of logs are recorded by a vehicle-mounted controller. From the existing log, a good ATO stop database can be built by:
firstly, the vehicle-mounted log is segmented according to station-to-station through an automatic script, and each data comprises a complete process of departure to arrival for parking.
And secondly, accumulating the distance from the detection of the approach disc to the stop of the train and the running of the train during the process of stopping the train by using the script and the approach disc information in the log. The stop accuracy of the train is determined by the distance, and if the stop accuracy of the train is within a certain number (for example +/-20 cm), the stop is considered to be better and is included in a good ATO stop database. (the detection range of the proximity disk is. + -. 0.5 m).
In this embodiment, ATO stop data of all existing lines are preferentially collected, data cleaning and segmentation are performed, and tracks with good stop are selected to form an expert data set. In specific implementation, data of the last brake phase of each ATO stop is extracted, and good tracks of the stop are selected to form an expert data set X. Therein, the binary (s, a) in the expert data set X means that the conventional good PI algorithm makes action a at state s. In this embodiment, the method of data cleaning and splitting is not limited, and is handled by the user.
S2, performing simulated learning of behavior cloning based on the expert data set X, and initializing a strategy network of the train;
and performing clone simulated learning on the expert data set X, and initializing a strategy network for reinforcement learning.
As shown in fig. 3, a reinforcement learning framework of a control train, wherein the meaning of each framework unit is explained as follows:
an Agent (Agent) that executes a VOBC of a policy network;
action (Action) using traction instruction, brake instruction and grade as Action information;
environment (Environment) trains, tracks, and other subsystems that affect train operation;
a State (State) in which the speed, acceleration, target distance, gradient, approach disc signal, etc. of the train are input as State information;
reward (Reward) the Reward is directly related to the stop accuracy. If the train stop precision D is within the index range (such as +/-20 cm), the reward obtained by the whole track (track) is 100- | D |; if the train stop precision D is not within the index range, the reward obtained by the whole track (track) is- (100 + | D |);
track (track) a track is a continuous state, sequence of actions. Before 5m from the station parking point, the traditional PI algorithm is used, and a deep reinforcement learning algorithm is started at 5 m. When the reinforcement learning algorithm is executed, the track starts; and after the train stops and the ATO gives up the control, ending the track.
And corresponding each part to the stop control algorithm of the ATO deep reinforcement learning to obtain the strategy network of the reinforcement learning algorithm shown in FIG. 4.
As shown in fig. 4, in this network, the architecture information of the reinforcement learning framework is used: the speed, acceleration, gradient, target distance, approach disc signal, positioning error and the like of the train are input as State information. Taking the State State information as an input layer of the fully-connected neural network, selecting a tanh activation function as an output layer activation function, and determining an output range between [ -1,1 ]; then do scaling, map to Action space range [ -255,255] (D/A converter precision). The output action is 0 to represent the coasting instruction, the output is a positive value to represent the traction level bit of the output corresponding digital quantity, and the output is a negative value to represent the brake level bit of the output corresponding digital quantity.
And initializing a strategy network of the train by using a behavioral cloning simulation learning method. The state s-expert action a in the expert data set X is extracted independently to serve as supervised training data, and a strategy network can be trained and determined by using a classical supervised learning method. The training process only needs to utilize expert data, repeatedly iterates to minimize a loss function, and does not need to interact with the environment. In particular, the method comprises the following steps of,
as an optional embodiment of the present application, optionally, performing a mimic learning of behavior cloning based on the expert data set X, and initializing a policy network of a train includes:
defining a loss function, and calculating errors of the strategy network output and the actual expert actions in the expert data set X by using the loss function; wherein the loss function is:
Figure BDA0003834931350000101
the loss function L (s, a; theta) is a single sample error and is used for describing an output value generated by the strategy network with the parameter theta in the input state s and an error which is the output a in the state s in the expert database;
(symbol)
Figure BDA0003834931350000102
the expression "defined as" means that the left expression is defined by the right expression;
[π(s;θ)-a] 2 the square of the error between the output of the strategy network pi (s; theta) and the actual expert action a is multiplied by 1/2 to eliminate the constant term of 2 when the gradient descent is derived;
defining a cost function, calculating the sum of the loss functions of all sample points of a whole track in the expert database X by using the cost function, and accumulating the loss of all tracks to obtain a total cost function J (theta); wherein the cost function is:
Figure BDA0003834931350000103
updating the neural network parameter theta by using the following gradient descent formula until the algorithm converges:
Figure BDA0003834931350000104
wherein:
θ now : current neural network parameters;
θ new : updated neural network parameters;
beta: the learning rate is a hyper-parameter and can be set;
Figure BDA0003834931350000105
cost function J (theta) at theta now A gradient value of (d);
and initializing the strategy network by utilizing the neural network parameter theta.
In the method, behavior cloning takes a as a label optimization strategy network pi (s; theta), and a regression method in supervised learning is used for training the strategy network. The parameter theta obtained by adopting the behavior cloning method is used for initializing the strategy network, and the strategy network has the capability of controlling the train stop.
In this embodiment, in order to obtain better stop accuracy, a depth deterministic strategy gradient method is further adopted to train the strategy network on line.
S3, training the strategy network on line through a depth determination strategy gradient method to obtain a depth optimization network;
as an optional embodiment of the present application, optionally, the depth determination policy gradient method is a depth determination policy gradient algorithm based on an Actor-Critic architecture. As shown in fig. 5, the depth determination policy gradient method is an Actor-Critic method, and further optimizes the value network by using the action a output by the policy network initialized by the neural network parameter θ in the step S2 and combining the state S.
As an optional embodiment of the present application, optionally, the method for determining a policy gradient through depth is used to train the policy network on line to obtain a deep optimization network, and includes:
a preset depth determination strategy gradient method;
establishing a deep optimization framework of a strategy network and a value network based on the depth determination strategy gradient method to obtain a deep optimization network; wherein, in the deep optimization network:
taking the state s of a train reinforcement learning framework as the input of the strategy network, and outputting an action a by the strategy network; as shown in fig. 5, wherein: policy network (operator): a π (s; θ), already initialized with behavioral clones;
taking the state s of the train reinforcement learning frame and the action a output by the strategy network as the input of a value network, and outputting a value q (s, a; w) by the value network; as shown in fig. 5, wherein: value network (critic): q (s, a; w), the value network outputs a scalar for evaluating the quality of the action; the inputs of the value network are a state s and an action output a of the policy network;
and optimizing and updating the deep optimization network, so that a strategy network and a value network of the deep optimization network can learn along with the characteristics of the train.
The optimization updating of the deep optimization network is to update the strategy network and the value network in the deep optimization network, so that the respective neural network parameters are adjusted along with the change of train characteristics, and the effect of lifelong learning is realized.
In this embodiment, as an optional implementation of the present application, optionally, the updating of the policy network adopts a monte carlo round updating, and the updating of the policy network adopts a time sequence differential updating.
Specifically, the value network is updated using Monte Carlo rounds, with the value network being updated once per round.
As an optional implementation of the present application, optionally, the method for updating the policy network of the deep optimization network includes:
in the final stage (for example 10 m) of the brake of the train stop station, the train stop station is controlled by a strategy network pi (s; theta) with fixed parameters; the strategy network controls the complete track (track) of the train motion to have T steps, and the instant return of each step before the train stops is 0, namely: r is 1 =r 2 =…=r T-1 =0;
Judging whether the train stop precision D is within the index range:
if the train stop precision D is within the index range (such as +/-20 cm), the reward obtained in the last step of the track (Tracjectory) is 100- | D |, namely: r is T =100-|D|;
If the train stop precision D is not within the index range (for example +/-20 cm), the final step of the track (track) obtains the reward of- (100 + | D |), namely: r is T =-(100+|D|);
If the train stopping accuracy D is within the index range (for example +/-20 cm), the accumulated reward from each step to the end of the track is as follows: u. u 1 =u 2 =…=u T =100-|D|;
If the train stopping accuracy D is not within the index range (for example +/-20 cm), the accumulated reward from each step to the end of the track is as follows: u. of 1 =u 2 =…=u T =-(100+|D|);
Calculating a cost function:
Figure BDA0003834931350000121
updating parameters:
Figure BDA0003834931350000122
wherein:
w now : current neural network parameters;
w new : updated neural network parameters;
alpha is a learning rate which is a super parameter and can be set;
Figure BDA0003834931350000131
cost function J (w) is at w now The gradient value of (d).
The calculated neural network parameter w of the policy network new Updating iteration is carried out according to the rounds, the strategy network is updated, the strategy network and the train characteristic are changed to make adjustment, and the effect of lifetime learning is achieved.
The policy network is updated using a time-sequence differential, the policy network is updated using a deterministic policy gradient algorithm, and the goal of the policy network is to increase the value of the value network.
As an optional implementation of the present application, optionally, the updating method of the value network of the deep optimization network includes:
fixing a value network parameter w, and controlling the train stop by using a strategy network;
by back propagation, the strategy gradient is calculated:
Figure BDA0003834931350000132
gradient ascent update parameter:
Figure BDA0003834931350000133
wherein:
θ now : current neural network parameters;
θ new : updated neural network parameters;
beta is a learning rate which is a hyper-parameter and can be set;
Figure BDA0003834931350000134
operator g (theta) at theta now The gradient value of (d).
The neural network parameter theta of the value network obtained by the calculation new Updating iteration is carried out according to a single-step updating mode, the value network is updated, and the value network is adjusted according to the characteristic change of the train, so that the effect of lifelong learning is achieved.
And S4, outputting and storing the deep optimization network, and using the deep optimization network for accurate train stop control.
And the strategy network and the value network of the deep optimization network are updated and used for controlling the stop precision of the train in real time. In the low-speed air braking stage, the deep reinforcement learning algorithm is used for replacing the original PI control algorithm, the online debugging time of the ATO software real vehicle is reduced, and the station stopping precision is improved, so that the technical problem that the deviation of the station stopping precision is large due to the fact that the air braking system performance is slow in response and high in randomness and the PI controller cannot well control the air braking system is solved.
Therefore, the method and the device can make full use of ATO historical data of the existing line and perform simulation learning offline; decoupling the coupling relation between the PI control parameters and the vehicle characteristics by means of the generalization capability of the neural network; the online debugging time of the ATO software real vehicle can be reduced, and the stop precision is improved; the strategy network and the value network of reinforcement learning can be adjusted along with the change of train characteristics, and the effect of lifetime learning is realized.
It should be noted that, although the architecture design and network updating method of reinforcement learning are described above by taking Actor-Critic as an example, those skilled in the art can understand that the disclosure should not be limited thereto. In fact, the user can flexibly set the optimization algorithm of the policy network according to the actual application scenario, as long as the technical function of the present application can be realized according to the above technical method.
Example 2
Based on the implementation principle of embodiment 1, in another aspect of the present application, a device for implementing the train precise parking control method based on deep reinforcement learning is provided, including:
the ATO data acquisition module is used for collecting ATO stop data and carrying out pretreatment to obtain an expert data set X;
the initialization module is used for simulating and learning behavior cloning based on the expert data set X and initializing a strategy network of the train;
the network optimization module is used for training the strategy network on line by a depth determination strategy gradient method to obtain a depth optimization network;
and the train control module is used for outputting and storing the deep optimization network, and using the deep optimization network for accurate train stopping control.
The application principle and the information interaction principle of each module are described in detail in embodiment 1, which is not described in detail in this embodiment.
It should be apparent to those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware related to instructions of a computer program, and the program can be stored in a computer readable storage medium, and when executed, the program can include the processes of the embodiments of the control methods described above. The modules or steps of the invention described above can be implemented by a general purpose computing device, they can be centralized on a single computing device or distributed over a network of multiple computing devices, and they can alternatively be implemented by program code executable by a computing device, so that they can be stored in a storage device and executed by a computing device, or they can be separately fabricated into various integrated circuit modules, or multiple modules or steps thereof can be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.
It will be understood by those skilled in the art that all or part of the processes of the methods of the above embodiments may be implemented by hardware related to instructions of a computer program, and the program may be stored in a computer readable storage medium, and when executed, may include the processes of the embodiments of the control methods as described above. The storage medium may be a magnetic disk, an optical disk, a Read-only memory (ROM), a Random Access Memory (RAM), a flash memory (FlashMemory), a hard disk (hard disk drive, abbreviated as HDD) or a Solid State Drive (SSD), etc.; the storage medium may also comprise a combination of memories of the kind described above.
Example 3
Still further, in another aspect of the present application, a control system is further provided, including:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to execute the executable instructions to implement the method for controlling train accurate parking based on deep reinforcement learning.
Embodiments of the present disclosure provide a control system including a processor and a memory for storing processor-executable instructions. Wherein the processor is configured to execute the executable instructions to implement any one of the above-mentioned methods for train precise parking control based on deep reinforcement learning.
Here, it should be noted that the number of processors may be one or more. Meanwhile, in the control system of the embodiment of the present disclosure, an input device and an output device may be further included. The processor, the memory, the input device, and the output device may be connected by a bus, or may be connected by other means, and are not limited specifically herein.
The memory, as a computer-readable storage medium, may be used to store software programs, computer-executable programs, and various modules, such as: the disclosed embodiment relates to a program or a module corresponding to a train accurate parking control method based on deep reinforcement learning. The processor executes various functional applications of the control system and data processing by executing software programs or modules stored in the memory.
The input device may be used to receive an input number or signal. Wherein the signal may be a key signal generated in connection with user settings and function control of the device/terminal/server. The output device may include a display device such as a display screen.
Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terms used herein were chosen in order to best explain the principles of the embodiments, the practical application, or technical improvements to the techniques in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (10)

1. A train accurate parking control method based on deep reinforcement learning is characterized by comprising the following steps:
collecting ATO stop data, and preprocessing to obtain an expert data set X;
performing simulated learning of behavior cloning based on the expert data set X, and initializing a strategy network of the train;
training the strategy network on line by a depth determination strategy gradient method to obtain a depth optimization network;
and outputting and storing the deep optimization network, and using the deep optimization network for accurate train stopping control.
2. The method for controlling precise train stop based on deep reinforcement learning according to claim 1, wherein the collecting ATO stop data and preprocessing the ATO stop data to obtain an expert data set X comprises:
collecting ATO stop data of all existing lines;
according to a preset data preprocessing rule, carrying out data cleaning and segmentation on the ATO stop data to obtain ATO stop preprocessing data;
and extracting data of the last brake phase of each ATO stop from the ATO stop preprocessing data, and selecting data with stop tracks meeting expectations to form an expert data set X.
3. The method for controlling precise train stop based on deep reinforcement learning of claim 1, wherein the step of performing the simulation learning of behavior cloning based on the expert data set X and initializing the strategy network of the train comprises the following steps:
defining a loss function, and calculating errors of the strategy network output and the actual expert actions in the expert data set X by using the loss function; wherein the loss function is:
Figure FDA0003834931340000011
the loss function L (s, a; theta) is a single sample error and is used for describing an output value generated by the strategy network with the parameter theta in the input state s and an error which is the output a in the state s in the expert database;
(symbol)
Figure FDA0003834931340000012
the expression "defined as" means that the left expression is defined by the right expression;
[π(s;θ)-a] 2 the square of the error between the output of the strategy network pi (s; theta) and the actual expert action a is multiplied by 1/2 to eliminate the constant term of 2 when the gradient descent is derived;
defining a cost function, calculating the sum of the loss functions of all sample points of a whole track in the expert database X by using the cost function, and accumulating the loss of all tracks to obtain a total cost function J (theta); wherein the cost function is:
Figure FDA0003834931340000021
updating the neural network parameter theta by using the following gradient descent formula until the algorithm converges:
Figure FDA0003834931340000022
wherein:
θ now : current neural network parameters;
θ new : updated neural network parameters;
beta: the learning rate is a hyper-parameter and can be set;
Figure FDA0003834931340000023
cost function J (theta) at theta now A gradient value of (d);
and initializing the strategy network by utilizing the neural network parameter theta.
4. The method for controlling the precise parking of the train based on the deep reinforcement learning as claimed in claim 3, wherein the strategy network is trained on line through a deep strategy gradient determination method to obtain a deep optimization network, and the method comprises the following steps:
a preset depth determination strategy gradient method;
establishing a deep optimization framework of a strategy network and a value network based on the depth determination strategy gradient method to obtain a deep optimization network; wherein, in the deep optimization network:
taking the state s of a train reinforcement learning framework as the input of the strategy network, and outputting an action a by the strategy network;
taking the state s of the train reinforcement learning frame and the action a output by the strategy network as the input of a value network, and outputting a value q (s, a; w) by the value network;
and optimizing and updating the deep optimization network, so that a strategy network and a value network of the deep optimization network learn along with the characteristics of the train.
5. The method for controlling precise train parking based on deep reinforcement learning according to claim 4, wherein the depth determination strategy gradient method is a depth determination strategy gradient algorithm based on an Actor-Critic architecture.
6. The method for train precise parking control based on deep reinforcement learning of claim 4, wherein the strategy network is updated by Monte Carlo round updating, and the strategy network is updated by time sequence difference updating.
7. The method for controlling the accurate train stop based on the deep reinforcement learning as claimed in claim 4, wherein the method for updating the strategy network of the deep optimization network comprises the following steps:
in the final stage of train stop brake, the train stop is controlled by a strategy network pi (s; theta) with fixed parameters; the complete track for controlling the train to move by the strategy network has T steps, and the instant return of each step before the train stops is 0, namely: r is 1 =r 2 =…=r T-1 =0;
Judging whether the train stop precision D is within the index range:
if the train stop precision D is within the index range, the final step of the track (track) obtains the reward of 100-ventilationD |, i.e.: r is a radical of hydrogen T =100-|D|;
If the train stop precision D is not within the index range, the final step of the track (track) obtains the reward of- (100 + | D |), namely: r is T =-(100+|D|);
If the train stop precision D is within the index range, the accumulated reward from each step to the end of the track is as follows: u. of 1 =u 2 =…=u T =100-|D|;
If the train stop precision D is not within the index range, the accumulated reward from each step to the track end is as follows: u. of 1 =u 2 =…=u T =-(100+|D|;
Calculating a cost function:
Figure FDA0003834931340000031
updating parameters:
Figure FDA0003834931340000032
wherein:
w now : current neural network parameters;
w new : updated neural network parameters;
alpha is a learning rate which is a super parameter and can be set;
Figure FDA0003834931340000041
cost function J (w) is at w now The gradient value of (a).
8. The method for controlling the accurate stop of the train based on the deep reinforcement learning of claim 4, wherein the method for updating the value network of the deep optimization network comprises the following steps:
fixing a value network parameter w, and controlling the train stop by using a strategy network;
by back propagation, the strategy gradient is calculated:
Figure FDA0003834931340000042
gradient ascent update parameter:
Figure FDA0003834931340000043
wherein:
θ now : current neural network parameters;
θ new : updated neural network parameters;
the learning rate is a hyper-parameter which can be set;
Figure FDA0003834931340000044
operator g (theta) at theta now The gradient value of (d).
9. An apparatus for implementing the train precise parking control method based on deep reinforcement learning of any one of claims 1 to 8, comprising:
the ATO data acquisition module is used for collecting ATO stop data and carrying out pretreatment to obtain an expert data set X;
the initialization module is used for simulating and learning behavior cloning based on the expert data set X and initializing a strategy network of the train;
the network optimization module is used for training the strategy network on line by a depth determination strategy gradient method to obtain a depth optimization network;
and the train control module is used for outputting and storing the deep optimization network, and using the deep optimization network for accurate train stopping control.
10. A control system, comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to execute the executable instructions to implement the method for train precise parking control based on deep reinforcement learning of any one of claims 1 to 8.
CN202211084513.6A 2022-09-06 2022-09-06 Train accurate parking control method based on deep reinforcement learning Pending CN115496201A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211084513.6A CN115496201A (en) 2022-09-06 2022-09-06 Train accurate parking control method based on deep reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211084513.6A CN115496201A (en) 2022-09-06 2022-09-06 Train accurate parking control method based on deep reinforcement learning

Publications (1)

Publication Number Publication Date
CN115496201A true CN115496201A (en) 2022-12-20

Family

ID=84468012

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211084513.6A Pending CN115496201A (en) 2022-09-06 2022-09-06 Train accurate parking control method based on deep reinforcement learning

Country Status (1)

Country Link
CN (1) CN115496201A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115837899A (en) * 2023-02-16 2023-03-24 华东交通大学 Multi-model adaptive fault compensation control method and system for motor train unit braking system
CN116824207A (en) * 2023-04-27 2023-09-29 国科赛赋河北医药技术有限公司 Multidimensional pathological image classification and early warning method based on reinforcement learning mode

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115837899A (en) * 2023-02-16 2023-03-24 华东交通大学 Multi-model adaptive fault compensation control method and system for motor train unit braking system
CN116824207A (en) * 2023-04-27 2023-09-29 国科赛赋河北医药技术有限公司 Multidimensional pathological image classification and early warning method based on reinforcement learning mode
CN116824207B (en) * 2023-04-27 2024-04-12 国科赛赋河北医药技术有限公司 Multidimensional pathological image classification and early warning method based on reinforcement learning mode

Similar Documents

Publication Publication Date Title
CN110197027B (en) Automatic driving test method and device, intelligent equipment and server
CN115496201A (en) Train accurate parking control method based on deep reinforcement learning
US20220363259A1 (en) Method for generating lane changing decision-making model, method for lane changing decision-making of unmanned vehicle and electronic device
CN110481536B (en) Control method and device applied to hybrid electric vehicle
CN114194211B (en) Automatic driving method and device, electronic equipment and storage medium
CN111267830A (en) Hybrid power bus energy management method, device and storage medium
CN114261400B (en) Automatic driving decision method, device, equipment and storage medium
CN111352419B (en) Path planning method and system for updating experience playback cache based on time sequence difference
CN114639233B (en) Congestion state prediction method and device, electronic equipment and storage medium
WO2023082726A1 (en) Lane changing strategy generation method and apparatus, computer storage medium, and electronic device
WO2020098226A1 (en) System and methods of efficient, continuous, and safe learning using first principles and constraints
CN114074680B (en) Vehicle channel change behavior decision method and system based on deep reinforcement learning
CN114139637A (en) Multi-agent information fusion method and device, electronic equipment and readable storage medium
CN115330095A (en) Mine car dispatching model training method, device, chip, terminal, equipment and medium
CN102968663A (en) Unmarked sample-based neutral network constructing method and device
CN114386599B (en) Method and device for training trajectory prediction model and trajectory planning
CN116476863A (en) Automatic driving transverse and longitudinal integrated decision-making method based on deep reinforcement learning
CN116662815B (en) Training method of time prediction model and related equipment
CN115454082A (en) Vehicle obstacle avoidance method and system, computer readable storage medium and electronic device
CN113276860B (en) Vehicle control method, device, electronic device, and storage medium
CN114399107A (en) Prediction method and system of traffic state perception information
Ge et al. Deep reinforcement learning navigation via decision transformer in autonomous driving
CN112198794A (en) Unmanned driving method based on human-like driving rule and improved depth certainty strategy gradient
CN113298324B (en) Track prediction model method, system and device based on deep reinforcement learning and neural network
CN116558541B (en) Model training method and device, and track prediction method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination