CN115496201A

CN115496201A - Train accurate parking control method based on deep reinforcement learning

Info

Publication number: CN115496201A
Application number: CN202211084513.6A
Authority: CN
Inventors: 张磊; 张建国
Original assignee: Thales Sec Transportation System Ltd
Current assignee: Thales Sec Transportation System Ltd
Priority date: 2022-09-06
Filing date: 2022-09-06
Publication date: 2022-12-20

Abstract

The application relates to a train accurate parking control method based on deep reinforcement learning, which comprises the steps of collecting ATO stop data and preprocessing the ATO stop data to obtain an expert data set X; performing simulated learning of behavior cloning based on the expert data set X, and initializing a strategy network of the train; training the strategy network on line by a depth determination strategy gradient method to obtain a depth optimization network; and outputting and storing the deep optimization network, and using the deep optimization network for accurate train stopping control. The following technical effects can be achieved: ATO historical data of an existing line is fully utilized, and simulation learning is performed offline; decoupling the coupling relation between the PI control parameters and the vehicle characteristics by means of the generalization capability of the neural network; the technology can reduce the real vehicle online debugging time of ATO software and improve the stop precision; the strategy network and the value network of reinforcement learning can be adjusted along with the change of train characteristics, and the effect of lifetime learning is realized.

Description

Train accurate parking control method based on deep reinforcement learning

Technical Field

The disclosure relates to the technical field of rail transit, in particular to a train accurate parking control method, device and system based on deep reinforcement learning.

Background

The subway plays an important role in urban public transport, the intelligent degree is higher and higher, and the proportion of full-automatic unmanned lines is gradually increased. In the unmanned line, because no driver is equipped, the stop precision requirement on the ATO vehicle control is higher. In order to make the train stop accurately, optimization of a train control algorithm is always the research focus of ATO software.

The train control algorithm is generally tested on a train simulation model, and after the expected effect is achieved, a real train test is performed. The train is a nonlinear dynamic system which consists of complex technical equipment, runs in a complex environment and has complex space-time distribution characteristics. For the final stop phase, the train braking will switch from electric to air braking, which is a large hysteresis, nonlinear, somewhat random system.

A target curve is typically designed using classical control methods while a tracking controller is designed to achieve autonomous driving. Tracking controllers often employ classical PI control algorithms to track target curves. The PI control algorithm performs better when in traction or electric braking; however, in the low-speed air braking stage, due to the characteristics of large delay, nonlinearity, randomness and the like of air braking, the actual speed of the train cannot be attached to a target curve according to expectation, and the stop precision range of the train is large.

The conventional method for train simulation is to simplify a train into a linear dynamics system and describe the system through a dynamics equation so as to meet the requirement of control algorithm research. Because the simulation model is greatly simplified, the simulation model is difficult to simulate the real performance of the brake system when the train finally stops, and the stop control algorithm cannot be effectively simulated on the simulation model. After the stop control algorithm passes the simulation model test, a large amount of real vehicle debugging is still needed to improve the algorithm and set the algorithm parameters. And real vehicle debugging relates to multi-professional cooperation, the workload is large, and the improvement of the algorithm is not friendly. In addition, the effect of the existing stop control algorithm is related to the vehicle characteristics and the route conditions, and the existing route algorithm parameters are generally required to be debugged again after being transferred to another new route.

The control process of a common train comprises four stages of starting acceleration, cruising, coasting and braking and stopping. The train stop precision is only related to braking, and how to control the train before braking basically has no influence on the stop precision. When the train is braked at a high speed, the electric braking system plays a role, and the air brake does not work; during low-speed braking, the electric brake does not work, and the air brake plays a role. The electric brake system has fast response and good consistency and can well follow the control instruction of the PI controller; the performance of the air brake system is slow in response and high in randomness, and the PI controller cannot well control the air brake system, so that the stop precision is greatly deviated.

Disclosure of Invention

In order to solve the problems, the application provides a train accurate parking control method, a train accurate parking control device and a train accurate parking control system based on deep reinforcement learning.

On one hand, the application provides a train accurate parking control method based on deep reinforcement learning, which comprises the following steps:

collecting ATO stop data, and preprocessing to obtain an expert data set X;

performing simulated learning of behavior cloning based on the expert data set X, and initializing a strategy network of the train;

training the strategy network on line by a depth determination strategy gradient method to obtain a depth optimization network;

and outputting and storing the deep optimization network, and using the deep optimization network for accurate train stopping control.

As an optional embodiment of the present application, optionally, the ATO station stop data is collected and preprocessed to obtain an expert data set X, including:

collecting ATO stop data of all existing lines;

according to a preset data preprocessing rule, carrying out data cleaning and segmentation on the ATO stop data to obtain ATO stop preprocessing data;

and extracting data of the last brake phase of each ATO stop from the ATO stop preprocessing data, and selecting data with stop tracks meeting expectations to form an expert data set X.

As an optional embodiment of the present application, optionally, performing a mimic learning of behavior cloning based on the expert data set X, and initializing a policy network of a train includes:

defining a loss function, and calculating errors of the strategy network output and the actual expert actions in the expert data set X by using the loss function; wherein the loss function is:

the loss function L (s, a; theta) is a single sample error and is used for describing an output value generated by the strategy network with the parameter theta in the input state s and an error which is the output a in the state s in the expert database;

(symbol)

the expression "defined as" means that the left expression is defined by the right expression;

[π(s；θ)-a] ² the square of the error between the output of the strategy network pi (s; theta) and the actual expert action a is multiplied by 1/2 to eliminate the constant term of 2 when the gradient descent is derived;

defining a cost function, calculating the sum of the loss functions of all sample points of a whole track in the expert database X by using the cost function, and accumulating the losses of all tracks to obtain a total cost function J (theta); wherein the cost function is:

updating the neural network parameter theta by using the following gradient descent formula until the algorithm converges:

wherein:

θ _now : current neural network parameters;

θ _new : updated neural network parameters;

beta: the learning rate is a hyper-parameter and can be set;

cost function J (theta) at theta _now A gradient value of (d);

initializing the policy network using the neural network parameter θ.

As an optional embodiment of the present application, optionally, the method for online training the policy network by a depth determination policy gradient method to obtain a depth optimization network includes:

a preset depth determination strategy gradient method;

establishing a deep optimization framework of a strategy network and a value network based on the depth determination strategy gradient method to obtain a deep optimization network; wherein, in the deep optimization network:

taking the state s of a train reinforcement learning framework as the input of the strategy network, and outputting an action a by the strategy network;

taking the state s of the train reinforcement learning frame and the action a output by the strategy network as the input of a value network, and outputting a value q (s, a; w) by the value network;

and optimizing and updating the deep optimization network, so that a strategy network and a value network of the deep optimization network can learn along with the characteristics of the train.

As an optional embodiment of the present application, optionally, the depth determination policy gradient method is a depth determination policy gradient algorithm based on an Actor-Critic framework.

As an optional embodiment of the present application, optionally, the updating of the policy network employs a round update of monte carlo, and the updating of the policy network employs a time-series differential update.

As an optional implementation of the present application, optionally, the method for updating the policy network of the deep optimization network includes:

in the final stage of train stop brake, the train stop is controlled by a strategy network pi (s; theta) with fixed parameters; the strategy network controls the complete track of the train movement to have T steps, and the instant return of each step before the train stops is 0, namely: r is a radical of hydrogen ₁ ＝r ₂ ＝…＝r _T-1 ＝0；

Judging whether the train stop precision D is within the index range:

if the train stop precision D is within the index range, the final step of the track (Tracjectory) obtains the reward of 100- | D |, namely: r is _T ＝100-|D|；

If the train stop precision D is not within the index range, the final step of the track (track) obtains the reward of- (100 + | D |), namely: r is a radical of hydrogen _T ＝-(100+|D|)；

If the train stop precision D is within the index range, the accumulated reward from each step to the track end is as follows: u. of ₁ ＝u ₂ ＝…＝u _T ＝100-|D|；

If the train stop precision D is not within the index range, the accumulated reward from each step to the track end is as follows: u. of ₁ ＝u ₂ ＝…＝u _T ＝-(100+|D|)；

Calculating a cost function:

updating parameters:

wherein:

w _now : current neural network parameters;

w _new : updated neural network parameters;

alpha is a learning rate which is a super parameter and can be set;

cost function J (w) is at w _now The gradient value of (d).

As an optional implementation of the present application, optionally, the method for updating the value network of the deep optimization network includes:

fixing a value network parameter w, and controlling the train stop by using a strategy network;

by back propagation, the strategy gradient is calculated:

gradient ascent update parameters:

wherein:

θ _now : current neural network parameters;

θ _new : updated neural network parameters;

the learning rate is a hyper-parameter which can be set;

operator g (theta) at theta _now The gradient value of (a).

On the other hand, the application provides a device for implementing the train accurate parking control method based on deep reinforcement learning, which includes:

the ATO data acquisition module is used for collecting ATO stop data and carrying out pretreatment to obtain an expert data set X;

the initialization module is used for simulating and learning behavior cloning based on the expert data set X and initializing a strategy network of the train;

the network optimization module is used for training the strategy network on line by a depth determination strategy gradient method to obtain a depth optimization network;

and the train control module is used for outputting and storing the deep optimization network, and using the deep optimization network for accurate train stopping control.

In another aspect of the present application, a control system is further provided, including:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to execute the executable instructions to implement the method for controlling train accurate parking based on deep reinforcement learning.

The invention has the technical effects that:

according to the method, an expert data set X is obtained by collecting ATO stop data and preprocessing the ATO stop data; performing simulated learning of behavior cloning based on the expert data set X, and initializing a strategy network of the train; training the strategy network on line by a depth determination strategy gradient method to obtain a depth optimization network; and outputting and storing the deep optimization network, and using the deep optimization network for accurate train stop control. The following technical effects can be achieved:

ATO historical data of an existing line is fully utilized, and simulation learning is performed offline; decoupling the coupling relation between the PI control parameters and the vehicle characteristics by means of the generalization capability of the neural network;

the technology can reduce the online debugging time of the ATO software real vehicle and improve the stop precision; the strategy network and the value network of reinforcement learning can be adjusted along with the change of train characteristics, and the effect of lifetime learning is realized.

Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features, and aspects of the disclosure and, together with the description, serve to explain the principles of the disclosure.

FIG. 1 is a schematic flow chart of an implementation of the train precise parking control method based on deep reinforcement learning of the invention;

FIG. 2 is a control schematic of the on-board controller and the control interface of the train of the present invention;

FIG. 3 is a schematic diagram of the reinforcement learning framework of the present invention;

FIG. 4 is a schematic diagram of the architecture of the policy network of the train of the present invention;

FIG. 5 is a schematic diagram of the deep optimization architecture of the present invention.

Detailed Description

Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present disclosure.

Example 1

First, the control principle of the train needs to be clarified. Fig. 2 is a schematic diagram of a control interface between a vehicle-mounted controller and a train. The vehicle-mounted controller outputs a traction command and a braking command to the train to request traction or braking; the vehicle-mounted controller outputs 8-bit digital quantity to a D/A converter, and the D/A converter converts the 8-bit digital quantity into a current signal of 4-20 mA and outputs the current signal to a train. The vehicle-mounted controller fuses information such as a speed sensor, an acceleration sensor and a positioning antenna, calculates train positioning in real time, and obtains gradient and speed limit information from a line database in real time according to the positioning. The speed, acceleration and other real-time calculated state information obtained by the vehicle-mounted controller through the sensor in real time are used as input information of the ATO module, and the input information is used for real-time control of the train.

The invention provides a deep reinforcement learning algorithm, which improves the robustness of a control model, reduces the real vehicle debugging time and improves the ATO stop precision. On the basis, an expert data set X is utilized, an offline simulation learning method is also provided, and good ATO historical inbound data is simulated through behavior cloning to initialize a behavior strategy network; and then optimizing the strategy network on line through reinforcement learning. Specifically, the method comprises the following steps:

as shown in fig. 1, in one aspect, the present application provides a train precise parking control method based on deep reinforcement learning, including the following steps:

s1, collecting ATO stop data, and preprocessing to obtain an expert data set X;

based on a large amount of good ATO inbound historical data, a behavior cloning method is adopted to initialize a policy network, and a deep deterministic policy gradient algorithm model is further adopted to optimize the behavior policy network. As an optional embodiment of the present application, optionally, the ATO station-stop data is collected and preprocessed to obtain an expert data set X, including:

collecting ATO stop data of all existing lines;

At present, trains are controlled by the existing lines through a PI control algorithm, and a large amount of logs are recorded by a vehicle-mounted controller. From the existing log, a good ATO stop database can be built by:

firstly, the vehicle-mounted log is segmented according to station-to-station through an automatic script, and each data comprises a complete process of departure to arrival for parking.

And secondly, accumulating the distance from the detection of the approach disc to the stop of the train and the running of the train during the process of stopping the train by using the script and the approach disc information in the log. The stop accuracy of the train is determined by the distance, and if the stop accuracy of the train is within a certain number (for example +/-20 cm), the stop is considered to be better and is included in a good ATO stop database. (the detection range of the proximity disk is. + -. 0.5 m).

In this embodiment, ATO stop data of all existing lines are preferentially collected, data cleaning and segmentation are performed, and tracks with good stop are selected to form an expert data set. In specific implementation, data of the last brake phase of each ATO stop is extracted, and good tracks of the stop are selected to form an expert data set X. Therein, the binary (s, a) in the expert data set X means that the conventional good PI algorithm makes action a at state s. In this embodiment, the method of data cleaning and splitting is not limited, and is handled by the user.

S2, performing simulated learning of behavior cloning based on the expert data set X, and initializing a strategy network of the train;

and performing clone simulated learning on the expert data set X, and initializing a strategy network for reinforcement learning.

As shown in fig. 3, a reinforcement learning framework of a control train, wherein the meaning of each framework unit is explained as follows:

an Agent (Agent) that executes a VOBC of a policy network;

action (Action) using traction instruction, brake instruction and grade as Action information;

environment (Environment) trains, tracks, and other subsystems that affect train operation;

a State (State) in which the speed, acceleration, target distance, gradient, approach disc signal, etc. of the train are input as State information;

reward (Reward) the Reward is directly related to the stop accuracy. If the train stop precision D is within the index range (such as +/-20 cm), the reward obtained by the whole track (track) is 100- | D |; if the train stop precision D is not within the index range, the reward obtained by the whole track (track) is- (100 + | D |);

track (track) a track is a continuous state, sequence of actions. Before 5m from the station parking point, the traditional PI algorithm is used, and a deep reinforcement learning algorithm is started at 5 m. When the reinforcement learning algorithm is executed, the track starts; and after the train stops and the ATO gives up the control, ending the track.

And corresponding each part to the stop control algorithm of the ATO deep reinforcement learning to obtain the strategy network of the reinforcement learning algorithm shown in FIG. 4.

As shown in fig. 4, in this network, the architecture information of the reinforcement learning framework is used: the speed, acceleration, gradient, target distance, approach disc signal, positioning error and the like of the train are input as State information. Taking the State State information as an input layer of the fully-connected neural network, selecting a tanh activation function as an output layer activation function, and determining an output range between [ -1,1 ]; then do scaling, map to Action space range [ -255,255] (D/A converter precision). The output action is 0 to represent the coasting instruction, the output is a positive value to represent the traction level bit of the output corresponding digital quantity, and the output is a negative value to represent the brake level bit of the output corresponding digital quantity.

And initializing a strategy network of the train by using a behavioral cloning simulation learning method. The state s-expert action a in the expert data set X is extracted independently to serve as supervised training data, and a strategy network can be trained and determined by using a classical supervised learning method. The training process only needs to utilize expert data, repeatedly iterates to minimize a loss function, and does not need to interact with the environment. In particular, the method comprises the following steps of,

(symbol)

defining a cost function, calculating the sum of the loss functions of all sample points of a whole track in the expert database X by using the cost function, and accumulating the loss of all tracks to obtain a total cost function J (theta); wherein the cost function is:

wherein:

θ _now : current neural network parameters;

θ _new : updated neural network parameters;

beta: the learning rate is a hyper-parameter and can be set;

cost function J (theta) at theta _now A gradient value of (d);

and initializing the strategy network by utilizing the neural network parameter theta.

In the method, behavior cloning takes a as a label optimization strategy network pi (s; theta), and a regression method in supervised learning is used for training the strategy network. The parameter theta obtained by adopting the behavior cloning method is used for initializing the strategy network, and the strategy network has the capability of controlling the train stop.

In this embodiment, in order to obtain better stop accuracy, a depth deterministic strategy gradient method is further adopted to train the strategy network on line.

S3, training the strategy network on line through a depth determination strategy gradient method to obtain a depth optimization network;

as an optional embodiment of the present application, optionally, the depth determination policy gradient method is a depth determination policy gradient algorithm based on an Actor-Critic architecture. As shown in fig. 5, the depth determination policy gradient method is an Actor-Critic method, and further optimizes the value network by using the action a output by the policy network initialized by the neural network parameter θ in the step S2 and combining the state S.

As an optional embodiment of the present application, optionally, the method for determining a policy gradient through depth is used to train the policy network on line to obtain a deep optimization network, and includes:

a preset depth determination strategy gradient method;

taking the state s of a train reinforcement learning framework as the input of the strategy network, and outputting an action a by the strategy network; as shown in fig. 5, wherein: policy network (operator): a π (s; θ), already initialized with behavioral clones;

taking the state s of the train reinforcement learning frame and the action a output by the strategy network as the input of a value network, and outputting a value q (s, a; w) by the value network; as shown in fig. 5, wherein: value network (critic): q (s, a; w), the value network outputs a scalar for evaluating the quality of the action; the inputs of the value network are a state s and an action output a of the policy network;

The optimization updating of the deep optimization network is to update the strategy network and the value network in the deep optimization network, so that the respective neural network parameters are adjusted along with the change of train characteristics, and the effect of lifelong learning is realized.

In this embodiment, as an optional implementation of the present application, optionally, the updating of the policy network adopts a monte carlo round updating, and the updating of the policy network adopts a time sequence differential updating.

Specifically, the value network is updated using Monte Carlo rounds, with the value network being updated once per round.

in the final stage (for example 10 m) of the brake of the train stop station, the train stop station is controlled by a strategy network pi (s; theta) with fixed parameters; the strategy network controls the complete track (track) of the train motion to have T steps, and the instant return of each step before the train stops is 0, namely: r is ₁ ＝r ₂ ＝…＝r _T-1 ＝0；

Judging whether the train stop precision D is within the index range:

if the train stop precision D is within the index range (such as +/-20 cm), the reward obtained in the last step of the track (Tracjectory) is 100- | D |, namely: r is _T ＝100-|D|；

If the train stop precision D is not within the index range (for example +/-20 cm), the final step of the track (track) obtains the reward of- (100 + | D |), namely: r is _T ＝-(100+|D|)；

If the train stopping accuracy D is within the index range (for example +/-20 cm), the accumulated reward from each step to the end of the track is as follows: u. u ₁ ＝u ₂ ＝…＝u _T ＝100-|D|；

If the train stopping accuracy D is not within the index range (for example +/-20 cm), the accumulated reward from each step to the end of the track is as follows: u. of ₁ ＝u ₂ ＝…＝u _T ＝-(100+|D|)；

Calculating a cost function:

updating parameters:

wherein:

w _now : current neural network parameters;

w _new : updated neural network parameters;

alpha is a learning rate which is a super parameter and can be set;

cost function J (w) is at w _now The gradient value of (d).

The calculated neural network parameter w of the policy network _new Updating iteration is carried out according to the rounds, the strategy network is updated, the strategy network and the train characteristic are changed to make adjustment, and the effect of lifetime learning is achieved.

The policy network is updated using a time-sequence differential, the policy network is updated using a deterministic policy gradient algorithm, and the goal of the policy network is to increase the value of the value network.

As an optional implementation of the present application, optionally, the updating method of the value network of the deep optimization network includes:

by back propagation, the strategy gradient is calculated:

gradient ascent update parameter:

wherein:

θ _now : current neural network parameters;

θ _new : updated neural network parameters;

beta is a learning rate which is a hyper-parameter and can be set;

operator g (theta) at theta _now The gradient value of (d).

The neural network parameter theta of the value network obtained by the calculation _new Updating iteration is carried out according to a single-step updating mode, the value network is updated, and the value network is adjusted according to the characteristic change of the train, so that the effect of lifelong learning is achieved.

And S4, outputting and storing the deep optimization network, and using the deep optimization network for accurate train stop control.

And the strategy network and the value network of the deep optimization network are updated and used for controlling the stop precision of the train in real time. In the low-speed air braking stage, the deep reinforcement learning algorithm is used for replacing the original PI control algorithm, the online debugging time of the ATO software real vehicle is reduced, and the station stopping precision is improved, so that the technical problem that the deviation of the station stopping precision is large due to the fact that the air braking system performance is slow in response and high in randomness and the PI controller cannot well control the air braking system is solved.

Therefore, the method and the device can make full use of ATO historical data of the existing line and perform simulation learning offline; decoupling the coupling relation between the PI control parameters and the vehicle characteristics by means of the generalization capability of the neural network; the online debugging time of the ATO software real vehicle can be reduced, and the stop precision is improved; the strategy network and the value network of reinforcement learning can be adjusted along with the change of train characteristics, and the effect of lifetime learning is realized.

It should be noted that, although the architecture design and network updating method of reinforcement learning are described above by taking Actor-Critic as an example, those skilled in the art can understand that the disclosure should not be limited thereto. In fact, the user can flexibly set the optimization algorithm of the policy network according to the actual application scenario, as long as the technical function of the present application can be realized according to the above technical method.

Example 2

Based on the implementation principle of embodiment 1, in another aspect of the present application, a device for implementing the train precise parking control method based on deep reinforcement learning is provided, including:

The application principle and the information interaction principle of each module are described in detail in embodiment 1, which is not described in detail in this embodiment.

It should be apparent to those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware related to instructions of a computer program, and the program can be stored in a computer readable storage medium, and when executed, the program can include the processes of the embodiments of the control methods described above. The modules or steps of the invention described above can be implemented by a general purpose computing device, they can be centralized on a single computing device or distributed over a network of multiple computing devices, and they can alternatively be implemented by program code executable by a computing device, so that they can be stored in a storage device and executed by a computing device, or they can be separately fabricated into various integrated circuit modules, or multiple modules or steps thereof can be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

It will be understood by those skilled in the art that all or part of the processes of the methods of the above embodiments may be implemented by hardware related to instructions of a computer program, and the program may be stored in a computer readable storage medium, and when executed, may include the processes of the embodiments of the control methods as described above. The storage medium may be a magnetic disk, an optical disk, a Read-only memory (ROM), a Random Access Memory (RAM), a flash memory (FlashMemory), a hard disk (hard disk drive, abbreviated as HDD) or a Solid State Drive (SSD), etc.; the storage medium may also comprise a combination of memories of the kind described above.

Example 3

Still further, in another aspect of the present application, a control system is further provided, including:

a processor;

a memory for storing processor-executable instructions;

Embodiments of the present disclosure provide a control system including a processor and a memory for storing processor-executable instructions. Wherein the processor is configured to execute the executable instructions to implement any one of the above-mentioned methods for train precise parking control based on deep reinforcement learning.

Here, it should be noted that the number of processors may be one or more. Meanwhile, in the control system of the embodiment of the present disclosure, an input device and an output device may be further included. The processor, the memory, the input device, and the output device may be connected by a bus, or may be connected by other means, and are not limited specifically herein.

The memory, as a computer-readable storage medium, may be used to store software programs, computer-executable programs, and various modules, such as: the disclosed embodiment relates to a program or a module corresponding to a train accurate parking control method based on deep reinforcement learning. The processor executes various functional applications of the control system and data processing by executing software programs or modules stored in the memory.

The input device may be used to receive an input number or signal. Wherein the signal may be a key signal generated in connection with user settings and function control of the device/terminal/server. The output device may include a display device such as a display screen.

Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terms used herein were chosen in order to best explain the principles of the embodiments, the practical application, or technical improvements to the techniques in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A train accurate parking control method based on deep reinforcement learning is characterized by comprising the following steps:

collecting ATO stop data, and preprocessing to obtain an expert data set X;

2. The method for controlling precise train stop based on deep reinforcement learning according to claim 1, wherein the collecting ATO stop data and preprocessing the ATO stop data to obtain an expert data set X comprises:

collecting ATO stop data of all existing lines;

3. The method for controlling precise train stop based on deep reinforcement learning of claim 1, wherein the step of performing the simulation learning of behavior cloning based on the expert data set X and initializing the strategy network of the train comprises the following steps:

(symbol)

wherein:

θ _now : current neural network parameters;

θ _new : updated neural network parameters;

beta: the learning rate is a hyper-parameter and can be set;

cost function J (theta) at theta _now A gradient value of (d);

4. The method for controlling the precise parking of the train based on the deep reinforcement learning as claimed in claim 3, wherein the strategy network is trained on line through a deep strategy gradient determination method to obtain a deep optimization network, and the method comprises the following steps:

a preset depth determination strategy gradient method;

and optimizing and updating the deep optimization network, so that a strategy network and a value network of the deep optimization network learn along with the characteristics of the train.

5. The method for controlling precise train parking based on deep reinforcement learning according to claim 4, wherein the depth determination strategy gradient method is a depth determination strategy gradient algorithm based on an Actor-Critic architecture.

6. The method for train precise parking control based on deep reinforcement learning of claim 4, wherein the strategy network is updated by Monte Carlo round updating, and the strategy network is updated by time sequence difference updating.

7. The method for controlling the accurate train stop based on the deep reinforcement learning as claimed in claim 4, wherein the method for updating the strategy network of the deep optimization network comprises the following steps:

in the final stage of train stop brake, the train stop is controlled by a strategy network pi (s; theta) with fixed parameters; the complete track for controlling the train to move by the strategy network has T steps, and the instant return of each step before the train stops is 0, namely: r is ₁ ＝r ₂ ＝…＝r _T-1 ＝0；

Judging whether the train stop precision D is within the index range:

if the train stop precision D is within the index range, the final step of the track (track) obtains the reward of 100-ventilationD |, i.e.: r is a radical of hydrogen _T ＝100-|D|；

If the train stop precision D is not within the index range, the final step of the track (track) obtains the reward of- (100 + | D |), namely: r is _T ＝-(100+|D|)；

If the train stop precision D is within the index range, the accumulated reward from each step to the end of the track is as follows: u. of ₁ ＝u ₂ ＝…＝u _T ＝100-|D|；

If the train stop precision D is not within the index range, the accumulated reward from each step to the track end is as follows: u. of ₁ ＝u ₂ ＝…＝u _T ＝-(100+|D|；

Calculating a cost function:

updating parameters:

wherein:

w _now : current neural network parameters;

w _new : updated neural network parameters;

alpha is a learning rate which is a super parameter and can be set;

cost function J (w) is at w _now The gradient value of (a).

8. The method for controlling the accurate stop of the train based on the deep reinforcement learning of claim 4, wherein the method for updating the value network of the deep optimization network comprises the following steps:

by back propagation, the strategy gradient is calculated:

gradient ascent update parameter:

wherein:

θ _now : current neural network parameters;

θ _new : updated neural network parameters;

the learning rate is a hyper-parameter which can be set;

operator g (theta) at theta _now The gradient value of (d).

9. An apparatus for implementing the train precise parking control method based on deep reinforcement learning of any one of claims 1 to 8, comprising:

10. A control system, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to execute the executable instructions to implement the method for train precise parking control based on deep reinforcement learning of any one of claims 1 to 8.