CN116562553A

CN116562553A - Intelligent train dispatching optimization method and system and electronic equipment

Info

Publication number: CN116562553A
Application number: CN202310467034.0A
Authority: CN
Inventors: 阴佳腾; 吴卫; 陈星�; 范礼乾; 张金雷; 杨立兴
Original assignee: Nanchang Rail Transit Group Co ltd; Beijing Jiaotong University; China Railway Siyuan Survey and Design Group Co Ltd
Current assignee: Nanchang Rail Transit Group Co ltd; Beijing Jiaotong University; China Railway Siyuan Survey and Design Group Co Ltd
Priority date: 2023-04-27
Filing date: 2023-04-27
Publication date: 2023-08-08

Abstract

The invention discloses an intelligent train dispatching optimization method, an intelligent train dispatching optimization system and electronic equipment, and relates to the technical field of intelligent train dispatching. Acquiring actual performance operation data of a train to be tested at the current moment; inputting actual performance operation data into a multi-task deep reinforcement learning model to obtain a scheduling strategy of the train to be tested at the next moment; the multi-task deep reinforcement learning model is obtained by training a dual-pair neural network model by utilizing historical operation data of trains under multiple scenes; and controlling the running of the train to be tested according to the scheduling strategy of the next moment of the train to be tested. The invention can complete train dispatching under multiple scenes by constructing the multi-task deep reinforcement learning model, and improves the rationality of train dispatching.

Description

Intelligent train dispatching optimization method and system and electronic equipment

Technical Field

The invention relates to the technical field of intelligent train dispatching, in particular to an intelligent train dispatching optimization method, an intelligent train dispatching optimization system and electronic equipment.

Background

Various emergencies, such as bad weather, foreign matter invasion, faults of vehicles and trackside equipment and the like, often occur in the daily operation of the high-speed rail, so that the normal operation of the train is affected, and the train is delayed. In the existing high-speed rail dispatching command system, dispatching adjustment strategies such as a train stop-locking station, a translation operation chart and the like under an emergency still need to be manually completed. Manual adjustment has the problems of low efficiency, unsmooth communication among schedulers in different jurisdictions, incapability of overall situation and the like, experience levels of different schedulers are uneven, and sometimes errors of scheduling command decisions can even cause train conflicts or larger-area late points. Therefore, there is a need to intensively study the intelligent dispatching command technology of the high-speed railway, and optimize the dispatching decision process of the train group from the global angle so as to reduce the train late in the emergency and improve the service quality of passengers.

In the current research of adopting a machine learning method to solve the train operation adjustment problem under the emergency, a corresponding model is trained aiming at a certain scene, and the model can only solve the operation adjustment problem under the scene. And when the trained model in a certain scene is applied to other scenes, the solving effect is not good. In actual operation, the emergency situations are various, the randomness is strong, and training a corresponding train operation adjustment model for each scene is not practical.

Disclosure of Invention

The invention aims to provide an intelligent train dispatching optimization method, system and electronic equipment, which can finish train dispatching under multiple scenes and improve the rationality of train dispatching.

In order to achieve the above object, the present invention provides the following solutions:

an intelligent train dispatching optimization method comprises the following steps:

acquiring actual performance operation data of a train to be tested at the current moment;

inputting the actual performance operation data into a multi-task deep reinforcement learning model to obtain a scheduling strategy of the train to be tested at the next moment; the multi-task deep reinforcement learning model is obtained by training a dual-pair neural network model by utilizing historical operation data of trains in multiple scenes;

And controlling the running of the train to be tested according to the scheduling strategy of the next moment of the train to be tested.

Optionally, before acquiring the actual performance running data of the train at the current moment, the method further includes:

determining historical operation data of trains in a plurality of training scenes;

constructing a plurality of deep reinforcement learning models; the deep reinforcement learning model corresponds to the training scene one by one; the deep reinforcement learning model comprises a Q-evaluation Net structural model and a Q-TargetNet structural model;

determining the plurality of deep reinforcement learning models as the deep reinforcement learning model at the 0 th iteration;

let the first iteration number i=1;

respectively carrying out parallel training on the depth reinforcement learning models in the ith-1 th iteration by utilizing the historical operation data of the trains in the plurality of training scenes until training rounds of the plurality of depth reinforcement learning models reach training round thresholds, so as to obtain the depth reinforcement learning model in the ith iteration;

judging whether the first iteration number reaches a first iteration number threshold value or not to obtain a first judgment result;

if the first judgment result is negative, calculating a return value of the ith iteration;

judging whether the return value of the ith iteration is greater than a return value threshold value or not, and obtaining a second judgment result;

If the second judgment result is negative, increasing the value of the first iteration number i by 1, and returning to the step of performing parallel training on the multiple i-1 th deep reinforcement learning models by using the historical operation data of the trains in the multiple training scenes until the training rounds of the multiple deep reinforcement learning models reach the training round threshold value, so as to obtain the i-th deep reinforcement learning model;

if the second judgment result is yes, performing multitasking training on the depth reinforcement learning model in the ith-1 th iteration by using the historical operation data of the train in the plurality of training scenes to obtain the depth reinforcement learning model in the ith iteration;

and if the first judgment result is yes, determining the depth reinforcement learning model in the ith iteration as a multi-task depth reinforcement learning model.

Optionally, before determining the historical operation data of the train in the plurality of training scenarios, the method further includes:

determining any scene in the train as a current scene;

respectively determining the difference degree of each scene except the current scene in the train and the current scene;

traversing all scenes in the train to obtain a plurality of difference degrees;

after the plurality of difference degrees are arranged in a descending order, determining that the scene corresponding to the preset number difference degrees is a pending scene set;

And performing de-duplication processing on the scenes in the undetermined scene set to obtain a plurality of training scenes.

Optionally, utilizing historical operation data of the train in a plurality of training scenes to respectively train the depth reinforcement learning models in parallel in a plurality of i-1 th iterations until training rounds of the plurality of depth reinforcement learning models reach training round thresholds, obtaining the depth reinforcement learning model in the i-th iterations, including,

determining any training scene as a current training scene;

initializing parameters of a Q-evaluateNet structural model corresponding to a current training scene;

initializing parameters of a Q-TargetNet structural model corresponding to a current training scene;

constructing a plurality of train state vectors at the current historical moment according to the historical operation data of the train;

inputting a plurality of train state vectors at the current historical moment into a Q-evaluation Net structure model to obtain a plurality of first action vectors;

executing corresponding action vectors based on the plurality of train state vectors at the current historical moment to obtain a plurality of train state vectors at the next historical moment;

inputting a plurality of train state vectors at the next historical moment into the Q-evaluation Net structure model to obtain a plurality of second motion vectors;

Inputting a plurality of train state vectors at the next historical moment into a Q-TargetNet structural model to obtain a plurality of target Q values;

determining a loss function value according to the plurality of second motion vectors and the plurality of target Q values;

according to the loss function value, the parameters of the Q-evaluation Net structure model are adjusted by using a gradient descent method, the current historical moment is updated, and the step of constructing a plurality of train state vectors at the current historical moment according to the historical running data of the train is returned until the parameter adjustment times of the Q-evaluation Net structure model reach a first parameter adjustment times threshold;

copying parameters of the Q-EvaluateNet structural model to the Q-TargetNet structural model, updating the current historical moment and returning to the step of constructing a plurality of train state vectors at the current historical moment according to the historical running data of the train until the parameter adjustment times of the Q-TargetNet structural model reach a second parameter adjustment times threshold;

and determining the trained Q-evaluation Net structural model as a deep reinforcement learning model in the ith iteration of the current scene.

Optionally, performing multitasking training on the deep reinforcement learning model in the ith-1 th iteration by using historical operation data of the train in the multiple training scenes to obtain the deep reinforcement learning model in the ith iteration, where the method includes:

Defining a task set; the tasks in the task set are in one-to-one correspondence with the training scenes;

determining optimal training sequences of a plurality of tasks in the task set by using a course algorithm, and constructing a task sequence according to the optimal training sequences;

determining a first task in a task sequence as a current task;

let the second iteration number m=1;

determining a training scene corresponding to the current task as a current training scene;

training a deep reinforcement learning model corresponding to the current task by utilizing historical operation data of the train in the current scene;

updating parameters of a plurality of deep reinforcement learning models according to the deep reinforcement learning model corresponding to the current task;

determining a composite loss function of a plurality of tasks;

determining the task with the smallest composite loss function as the current task, updating parameters of a current task deep reinforcement learning model, updating transfer information vectors of the current task to a plurality of non-current tasks by using a Frankwolfe algorithm, increasing the value of the second iteration number m by 1, and returning to the step of determining the training scene corresponding to the current task as the current training scene until the second iteration number reaches a second iteration number threshold value, thereby obtaining the deep reinforcement learning model in the ith iteration.

Optionally, after updating the transfer information vector of the current task to the plurality of non-current tasks by using the Frankwolfe algorithm, the method further includes:

and updating the corresponding deep reinforcement learning model according to the transfer information vector.

An intelligent train dispatch optimization system comprising:

the actual score operation data acquisition module is used for acquiring actual score operation data of the train to be tested at the current moment;

the scheduling strategy prediction module is used for inputting the actual performance operation data into a multi-task deep reinforcement learning model to obtain a scheduling strategy of the train to be tested at the next moment; the multi-task deep reinforcement learning model is obtained by training a dual-pair neural network model by utilizing historical operation data of trains in multiple scenes;

and the train control module is used for controlling the running of the train to be tested according to the scheduling strategy of the next moment of the train to be tested.

An electronic device comprising a memory for storing a computer program and a processor running the computer program to cause the electronic device to perform the method of intelligent dispatch optimization of a train.

Optionally, the memory is a readable storage medium.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

According to the intelligent train dispatching optimization method, system and electronic equipment provided by the invention, the historical operation data of trains in multiple scenes are utilized to train the dual-pair neural network model to obtain the multi-task deep reinforcement learning model, so that train dispatching in multiple scenes can be completed, and the rationality of train dispatching is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the drawings that are needed in the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a method for optimizing intelligent train dispatching in embodiment 1 of the present invention;

FIG. 2 is a diagram showing the structure of DDDQN model in embodiment 2 of the present invention;

FIG. 3 shows the present inventionQ in example 2 _E A neural network structure diagram;

fig. 4 is a flowchart of a multi-task deep reinforcement learning algorithm in embodiment 2 of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.

Example 1

As shown in fig. 1, this embodiment provides a method for optimizing intelligent train dispatching, including:

step 101: and acquiring actual performance operation data of the train to be tested at the current moment.

Step 102: and inputting the actual performance operation data into a multi-task deep reinforcement learning model to obtain a scheduling strategy of the train to be tested at the next moment.

The multi-task deep reinforcement learning model is obtained by training the dual-pair neural network model by utilizing historical operation data of trains in multiple scenes.

Step 103: and controlling the running of the train to be tested according to the scheduling strategy of the next moment of the train to be tested.

Prior to step 101, further comprising:

step 104: historical operating data of the train in a plurality of training scenarios is determined.

Step 105: constructing a plurality of deep reinforcement learning models; the deep reinforcement learning model corresponds to the training scene one by one; the deep reinforcement learning model comprises a Q-evaluation Net (Q is a value) structural model and a Q-TargetNet (Q is a value) structural model.

Step 106: the plurality of deep reinforcement learning models is determined to be the deep reinforcement learning model at the 0 th iteration.

Step 107: let the first iteration number i=1.

Step 108: and respectively carrying out parallel training on the depth reinforcement learning models in the ith-1 th iteration by utilizing the historical operation data of the trains in the plurality of training scenes until training rounds of the plurality of depth reinforcement learning models reach training round thresholds, so as to obtain the depth reinforcement learning model in the ith iteration.

Step 108, comprising,

step 1081: and determining any training scene as the current training scene.

Step 1082: and initializing parameters of the Q-EvaluateNet structural model corresponding to the current training scene.

Step 1083: and initializing parameters of the Q-TargetNet structural model corresponding to the current training scene.

Step 1084: and constructing a plurality of train state vectors at the current historical moment according to the historical operation data of the train.

Step 1085: and inputting the plurality of train state vectors at the current historical moment into the Q-evaluation Net structural model to obtain a plurality of first action vectors.

Step 1086: and executing corresponding motion vectors based on the plurality of train state vectors at the current historical moment to obtain a plurality of train state vectors at the next historical moment.

Step 1087: and inputting a plurality of train state vectors at the next historical moment into the Q-evaluation Net structural model to obtain a plurality of second motion vectors.

Step 1088: and inputting a plurality of train state vectors at the next historical moment into the Q-Target Net structural model to obtain a plurality of Target Q values.

Step 1089: a loss function value is determined based on the plurality of second motion vectors and the plurality of target Q values.

Step 10810: and (3) according to the loss function value, adjusting parameters of the Q-evaluation Net structural model by using a gradient descent method, updating the current historical moment, and returning to the step 1084 until the parameter adjustment times of the Q-evaluation Net structural model reach a first parameter adjustment times threshold.

Step 10811: copying parameters of the Q-evaluation Net structural model to the Q-Target Net structural model, updating the current historical moment and returning to the step 1084 until the parameter adjustment times of the Q-Target Net structural model reach a second parameter adjustment times threshold.

Step 10812: and determining the trained Q-EvaluateNet structural model as a deep reinforcement learning model in the ith iteration of the current scene.

Step 109: judging whether the first iteration number reaches a first iteration number threshold value or not to obtain a first judgment result; if the first determination result is no, executing step 1010; if the first determination result is yes, step 1014 is performed.

Step 1010: and calculating the return value of the ith iteration.

Step 1011: and judging whether the return value of the ith iteration is larger than a return value threshold value or not, and obtaining a second judgment result. If the second determination result is no, then step 1012 is performed; if the second determination result is yes, step 1013 is executed.

Step 1012: the value of the first iteration number i is incremented by 1 and step 108 is returned.

Step 1013: and respectively carrying out multitasking training on the multiple i-1 th iterative deep reinforcement learning models by utilizing the historical operation data of the trains in the multiple training scenes to obtain the i-1 th iterative deep reinforcement learning model.

Step 1014: and determining the depth reinforcement learning model at the ith iteration as a multi-task depth reinforcement learning model.

Prior to step 104, further comprising:

step 1015: and determining any scene in the train as the current scene.

Step 1016: and respectively determining the difference degree of each scene outside the current scene in the train and the current scene.

Step 1017: traversing all scenes in the train to obtain a plurality of difference degrees.

Step 1018: and after the plurality of difference degrees are arranged in a descending order, determining the scene corresponding to the preset number of difference degrees as a set of undetermined scenes.

Step 1019: and performing de-duplication processing on the scenes in the set of undetermined scenes to obtain a plurality of training scenes.

Step 1013: comprising:

step 10131: defining a task set; the tasks in the task set are in one-to-one correspondence with the training scenes.

Step 10132: and determining the optimal training sequence of a plurality of tasks in the task set by using a course algorithm, and constructing a task sequence according to the optimal training sequence.

Step 10133: and determining the first task in the task sequence as the current task.

Step 10134: let the second iteration number m=1.

Step 10135: and determining the training scene corresponding to the current task as the current training scene.

Step 10136: and training a deep reinforcement learning model corresponding to the current task by utilizing the historical operation data of the train in the current scene.

Step 10137: and updating parameters of the multiple deep reinforcement learning models according to the deep reinforcement learning model corresponding to the current task.

Step 10138: a composite loss function for a plurality of tasks is determined.

Step 10139: determining the task with the smallest composite loss function as the current task, updating parameters of a current task deep reinforcement learning model, updating transfer information vectors of the current task to a plurality of non-current tasks by using a Frankwolfe algorithm, increasing the value of the second iteration number m by 1, and returning to the step 10135 until the second iteration number reaches a second iteration number threshold value to obtain the i-th deep reinforcement learning model.

Further, in step 10139, after updating the transfer information vector of the current task to the plurality of non-current tasks using the Frankwolfe algorithm, further includes: and updating the corresponding deep reinforcement learning model according to the transfer information vector.

Example 2

The embodiment provides a train intelligent scheduling optimization method, which comprises the following steps:

step S1, acquiring and storing historical operation data of the train. The data can be obtained directly in a simulation environment or can be obtained from an operation control subsystem in an actual high-speed train dispatching command system. The obtained actual performance operation data form a state vector of the train, and the state vector is input data and a basis for train operation adjustment and scheduling decision.

Train state vector definition:

the state of train k at time tRepresented by a 1 x 12 row vector, as shown in equation (1). Wherein 12 elements represent different observation information of the train and can be divided into the type (m ₀ ～m ₃ ) Occupancy of adjacent zones of the train (m ₄ ～m ₈ ) Late train time (m) ₉ ) Train remaining operation time (m ₁₀ ) And train priority (m ₁₁ ) Five parts, specifically defined as follows:

m ₀ ～m ₃ : together representing the type of section in which the train is currently located. The sections constituting the line include three types in total, the schematic diagram thereof and the section in the following The encoding scheme in (a) is shown in table 1.

m ₄ : taking the straight section downlink schematic in table 1 as an example, when the train is located on the straight section a and the section B is being occupied by other trains, then m ₄ =1, otherwise m ₄ ＝0。

m ₅ : and m is equal to ₄ Similarly, the straight sections in Table 1For example, when the train is located on the straight section A and the section C is being occupied by other trains, then m ₅ =1, otherwise m ₅ ＝0。

m ₆ : and m is equal to ₄ Similarly, taking the straight section downlink schematic in table 1 as an example, when the train is located on the straight section a and the section D is being occupied by other trains, then m ₆ =1, otherwise m ₆ ＝0。m ₅ And m ₆ The arrangement of (2) can enlarge the observation range of the train.

m ₇ : taking the left-hand line section down schematic in table 1 as an example, when a train is located on the left-hand line section a and section B is being occupied by other trains, then m ₇ =1, otherwise m ₇ ＝0。

m ₈ : taking the right side line section descending schematic diagram in table 1 as an example, when the train is located on the right side line section a and the section B is being occupied by other trains, then m ₈ =1, otherwise m ₈ ＝0。

m ₉ : the current late time of the train is expressed in minutes.

m ₁₀ : the remaining running time of the train from the positive point to the terminal is expressed in minutes.

m ₁₁ : indicating the train priority, the larger the value, the higher the priority. The priority of the G-shaped head train without stopping is 3, the priority of the G-shaped head train with stopping is 2, and the priority of the D-shaped head train is 1.

TABLE 1 definition of section types in a line

Train action space definition:

the action space of the train contains four actions { A } ₀ ，A ₁ ，A ₂ ，A ₃ And the corresponding relation between the different positions of the train and the executable action is shown in a table 2.

Table 2 executable action table of train at different positions

The return function will guide the final convergence direction of the DDDQN model and will get the return value given by the environment after each train selection. Return function R _k The definition of (2) is shown in the formula. Wherein, the liquid crystal display device comprises a liquid crystal display device,the method for calculating the return generated when the train reaches the end state is shown as a formula (3). />The real-time return of the train after each decision is made is shown in the formula (4).

In the formula (3), d _k The late time when the train arrives at the terminal station is indicated, namely the longer the late time is, the larger the penalty value is. Since a train collision (i.e., two trains enter the same zone at the same time) may occur during the DDDQN model training, the variable c is set to indicate whether a train collision occurs. When c=0, no collision occurs, and when c=1, collision occurs. M is M ₀ A larger positive integer (set to 5000 in the present invention) is given a larger penalty when the trains collide.

In the formula (4), d _c Indicating the current time of the train at the later time, T _k Representing the total number of decisions that need to be made from the start to the end of the train, a _t The value of (2) and the motion selected for each decisionActing asRelated to the following. The return value decreases when the train is selected to stop and increases when the train is selected to run.

And step S2, building a Q-evaluation Net structure model according to the historical operation data.

Full-connection neural network Q _E Is used for completing the estimation and output of the Q value of each action under different states. Q (Q) _E Four layers are provided: input layer, hidden layer 1, hidden layer 2, output layer. The neural network structure is shown in fig. 3, in which the input layer dimension is the same as the state vector dimension, and includes 12 nodes in total for receiving the input state vector. The hidden layer 1 comprises 32 neurons in total, the feature extraction and nonlinear transformation of the input data are completed, and the result is transferred to the hidden layer 2, the calculation process of each neuron is shown as a formula (5), wherein m represents the number of data transferred to the layer from the previous layer (namely the number of neurons of the previous layer), and omega _k Representing the weights of the output variables of the previous layer to the neurons, b representing the bias items of the layer, and only one bias item in each hidden layer, f representing the activation function, wherein the ReLU function is adopted by the invention. Similarly, the hidden layer 2 receives the output data of the hidden layer 1, performs nonlinear transformation again, and passes the result to the output layer. The hidden layer 2 contains 16 neurons. The output layer comprises 5 output data, namely action dominant values of four actions and a value of a current state, and the two parts are calculated again to finally obtain Q values of the four actions The calculation mode is shown as a formula (6), wherein, theta _t Representing neural network Q _E A parameter including the weight value omega of the neural network _k And a bias term b. />Representing status->Is added to the value of (a),representing the selection of an action under the current action +.>Dominance value of->The value of action i among the action values of the four outputs is represented.

And step S3, building a Q-Target Net structure model according to the historical operation data.

Q _T Is to generate a target Q valueTo guide Q _E Is the training and convergence of its neural network structure and Q _E The update frequency of the parameters is not uniform, but is completely uniform. />The calculation mode of (2) is shown as (8), wherein +.>The return value obtained by train k at time t is represented by β=0.9, which is a discount coefficient, a ^* A status vector representing the train at the next moment +.>Input to middle Q _E The operation with the highest Q value is calculated by the formula (7). />Represents Q _T Is a parameter of (a). />I.e. the train is in the next state, through Q _E The operation with the highest Q value is Q _T Corresponding value of (a), which represents the long-term response to action a ^* An estimate of the value, also the state +.>Lower action->Is set to the target Q value of (a).

And S4, training a double-countermeasure neural network model formed by the Q-evaluation Net structural model and the Q-Target Net structural model to obtain an optimized double-countermeasure neural network model.

When the number of the experience data stored in the memory library is larger than a certain value (256 pieces of the invention are designed), training of the neural network is started. Randomly extracting M=128 pieces of empirical data from a memory bank, and adding each piece of empirical data to the memory bankSequentially input Q _E In (3), M output Q are obtained through the calculation of the formula (6) _Ei ＝{Q _E1 ，Q _E2 ，…，Q _EM }. +.>Sequentially input Q _E And Q _T Obtaining M output Q by the calculation of the formula (8) _Ti ＝{Q _T1 ，Q _T2 ，…，Q _TM }. Obtaining the loss value L (theta) of the training through the calculation of the formula (9) _t ). Thereafter, Q _E Gradient descent using Adam optimizer to find minimum loss value L (θ _t ) And updating the back propagation implementation parameters as shown in formula (10), wherein =0.0025 is learning rate, θ _t+1 Is the parameter after the neural network completes the primary parameter updating. Q (Q) _T Parameter of->No back propagation of the loss value is performed, the parameter of which is Q _E After every 10 updates of the parameters, Q _E Parameter θ _t Copy to Q _T Is updated.

And S5, inputting the current state information of the train into the optimized double-countermeasure neural network model to obtain the optimal execution action of the train in the current state, and carrying out scheduling adjustment on the train according to the optimal execution action so as to achieve the purpose of shortening delay time.

And inputting the state vector of the train into the DDDQN model after the training is completed, and obtaining the selectable optimal action of the train in the current state. The action can be used as an aid for a dispatcher to give a dispatching instruction, and can guide the dispatcher to conduct arrangement of the in-station train route. Meanwhile, the selection of the train to the action is essentially that the time of the train entering each track section is selected, so that the arrival time of the train at each station and the running time of the train between stations can be further obtained, and finally, an adjusted train schedule is obtained, and support is provided for running adjustment of the train.

The technical scheme of the invention is that a Double-countermeasure DQN (DDDQN) model based on a value function is provided, and a training mode and an application scene of the model are adjusted on the basis, so that multi-task deep reinforcement learning is realized. The overall structure of the DDDQN model is shown in figure 2, and the learning process is as follows: a simulation environment is created in which an agent (i.e., a train) inputs states into a function of values fitted to the neural network to obtain the value (Q value) of each action. And then the train selects the action with the highest Q value to execute and enter the next state, and simultaneously obtains the return value of the environmental feedback. Train current state vector Action performed->Obtained return value->Next state vector +.>These four elements are regarded as one piece of empirical data +.>Stored in a memory bank, the neural network is trained and updated based on the empirical data so that a more accurate estimate of the value of the action in each state is obtained. The above process is repeated until the set number of training rounds is reached. After training, the train state is input into the input model, so that the value of each action can be accurately obtained, and further, the optimal scheduling decision is made.

Based on the DDDQN deep reinforcement learning model constructed above, the following multitask model training is started, including:

step 1: t different training scenarios are divided and selected (t=6).

The method provided by the embodiment mainly solves the problem of train operation adjustment under emergency scenes (hereinafter referred to as scenes), so that the scenes need to be defined. Any scene is set to pass through a triplet (s _m ，t _m ，p _m ) A definition is made in which three elements respectively represent a line break start time, a break duration, and a line break position of scene m.

The definition and selection of the scene need only specify the line break start time s _m Duration t of interruption _m Line break position p _m Three parameters are needed. In the setting of the invention, the interruption starting time is increased from 10 am to 19 pm every one hour by taking the Beijing Zhang Gaotie line as the background, and 10 alternative interruption starting times are provided. The interrupt duration is incremented from 30 minutes to 150 minutes in 30 minute steps for a total of 5 selectable interrupt durations. And selecting an inter-station interval from the river-cleaning station to the North-declaration station, wherein the total number of the inter-station intervals is 6, and the number of the inter-station intervals is selectable. The scene may be constructed from any combination of all the optional elements described above (although different training scenes may result in different effects on the final training).

The number of scene that can be composed is 10×5×6=300, based on the number of optional elements in the three sets. For convenience in selection, a mathematical optimization model is constructed for scene selection and can be used for reference. In training of a model, scenes with large differences need to be selected as much as possible to improve the training effect of the model. Therefore, a quantitative mode of the difference between different scenes is defined, an IP model is constructed for optimization solution, and N scenes with the largest difference degree are obtained for training. The objective function of the model is shown in equation (11), and the optimization objective is to maximize the degree of difference between scenes. Wherein x is _m A variable of 0-1 indicates whether scene m is selected, and N scene formulas (12) are selected in total. c _m,n The degree of difference between scene m and scene n is represented by the following formula (13).Denoted at s _m At the time, the position p of the interrupt scene m _m Is a number of passing trains. The number of trains passing through each section at 8 to 21 points is shown in table 3.

Table 3 table of passing trains for each section from 10 to 19 points

/>

After solving the IP model, n=6 scenes with the largest difference degree are selected as training scenes, the specific settings of each scene are shown in table 4, and the chinese names of the line interruption positions are compared with table 5.

Table 4 training scenario set-up table

TABLE 5 Chinese name lookup table for line break locations

Step 2: different scenarios are trained in parallel on distributed computers.

The method comprises the steps of deploying T different training scenes on T computers, and performing N rounds of independent training.

Step 3: calculation I _pro Value according to I _pro The value selection continues the parallel training mode or switches to the multitasking training mode. After N rounds of independent training are finished, calculating I according to the number of rounds of current training, the loss value of each scene and the return value of each scene _pro Values.

I _pro Definition of values: in the training process, in order to improve the stability of model training and avoid negative influence of each model during simultaneous training, an alternating mode of independent parallel training and multi-task training of each task is adopted, more modes of independent training of each task are adopted to control the model to converge as much as possible when the overall training of the model is unstable, and after the model is stable, the information transfer among the models is completed by adopting the multi-task training mode. I _pro The definition of the index for measuring the model learning progress is shown in the formula (14). Setting a random number [ mu ] E [0,1 ]]To control the selection of model training patterns, when I _pro If the training is larger than mu, multitasking training is adopted, otherwise parallel training is adopted. I.e. I _pro The larger the value, the more beneficial the way in which the multitasking is used, and the more probable the way in which the multitasking is chosen.

In equation (14), the learning progress is described by three dimensions, corresponding to i ₁ 、i ₂ 、i ₃ The training round number progress, the current loss value size and the current return value size are respectively. Wherein i is ₁ The training progress of the model is described from the point of view of the number of training rounds as shown in formula (15). Where n represents the number of current training rounds and a represents the set total training rounds. i.e ₁ The value of (2) decreases with the increasing number of training rounds to encourage the model to learn as much as possible in the initial stage of training, making training more difficultThe available prior knowledge is obtained from the easier training tasks as early as possible to speed up the training progress.

i ₂ The training progress of the model is described from the point of view of the loss value, as shown in equation (16). Wherein b is a coefficient,the normalized loss value of the task t in the current training round is represented by the calculation mode shown in the formula (17). />Representing the average loss value of task t from training start to current round,/>The standard deviation of the loss value of task t from the start of training to the current round is represented.The smaller the value, i ₂ The larger the value is, the more stable the training of each task is, and the multi-task learning can be performed.

i ₃ The training progress of the model is described in terms of the return value, as shown in equation (18). Where T represents the total number of tasks to be trained. r is (r) _out Indicating that the return value is not within the range R in the current training round _cur -c·R _std ，R _cur +c·R _std ]The number of tasks within. Wherein c is a coefficient, R _cur Representing the average value of the return values of all tasks under the current round, R _std Representing the standard deviation of the return values for all tasks in the current round. i.e ₃ The larger the value, the more scattered the return value of each task, and the larger the difference between the tasks, in this case, the more the experience migration between different tasks needs to be performed by adopting multi-task learning.

If I _pro Less than μ, perform parallel training: and (5) repeating the step 2 and the step 3. If I _pro And (5) performing multitasking training, wherein mu is larger than or equal to mu.

Step 4: the current composite loss function value is calculated.

The composite loss function is shown in equation (19). Wherein lambda is ₀ To lambda ₄ Are super parameters. For convenience of explanation of the meaning of each term in the formula, the first term in the formula (19) is hereinafter abbreviated as λ ₀ The second term in the formula is abbreviated as lambda ₁ Items, and so on.

When the multi-task deep reinforcement learning model is trained, one scene is a subtask. Training a model by T different subtasks, for each subtask T E {1,2, …, T }, which is used to fit the neural network parameters θ of the value function _t Can be represented by a neural network which solves other subtasks and has the same structure, and the form is shown as a formula (20). Where B is an asymmetric matrix of size T x T, storing information weight values transferred between tasks. B (B) _st To represent the model parameters of task t by the model parameters of task s, the weight values of the information transferred are calculated. For B _tt I.e. the delivery weight of the same inter-task network parameter is always set to 1.

In order to train a multi-task deep reinforcement learning model which can adapt to various scenes, regularization terms are added to restrict model parameters as simply as possible and prevent overfitting besides minimizing the loss between a target Q value and an estimated Q value. The loss function including these two parts is shown in equation (21). Wherein the two terms are respectively the loss value term and L of the model ₂ Regularization term, μ and λ are coefficients that adjust the importance of the two-part loss value. W= { θ ₁ ,…,θ _t And represents the set of network parameters for all tasks.Weight vector representing information transmitted by task t to other tasks, wherein +.>The sparsity of the transmission matrix B can be met, and the calculation speed is increased. L (θ) _t ) The model representing task t generates a loss value at each iteration update. It follows that solving the problem requires obtaining the minimum loss value of equation (21) by updating the model parameters W and the transfer matrix B for all subtasks.

However, in the case of more training scenarios, the higher dimension of the matrix W is not beneficial to calculation, and the loss value of each task continuously fluctuates during the training process, which may cause negative migration of information between each task (i.e., information transfer between different models makes model performance worse or updates to wrong directions). In order to avoid the problems, the idea of course learning is introduced, so that the model dynamically adjusts the training sequence of different tasks during training, and the model gradually transits from the task easy to train to the more complex task.

Let S be the set of all possible sequences of T tasks, let element pi E S be the sequence of one possible task training in the set, pi (i) represent the ith task to be trained in the sequence pi. The optimization objective that considers the task training sequence on the basis of equation (21) is converted into equation (22). In the original definition of equation (20), the parameters of one network need to be represented by the parameters of all other networks, and after course learning to provide an optimal training sequence, the model parameters of one untrained task are related to the parameters of the model trained before, thus avoiding the dimension disaster that may be caused by directly updating the B matrix.

Specifically, the training sequence of the model is determined by the magnitude of the loss value of each task in each training round. Let τ= { pi (1), …, pi (i-1) } denote the task set that has already been trained, u= {1, …, T } \τ denote the model set that has not yet been trained in this round, and then the next task T e U that needs to be trained is determined in the manner shown in equation (23). The lambda in the composite loss function ₀ And lambda is ₁ An item.

At lambda ₂ In the item, let r _t,i The model for training task t is shown applying the return value taken at task i. It is assumed that during the training process, we getIn the transition matrix B, the relationship of the individual values in the transition vector for task t should also be satisfied +.>To meet this requirement, lambda is set in formula (19) ₂ Item->Representation->In inequality->Ranking of->Representation->In inequality->Is included. The objective of the optimization should be to make +.>And->Ranking order->And->Keep the same, i.e. need to minimize +.>

And lambda is ₂ Similarly, at lambda ₃ In the item of manufacture,loss value L (θ) indicating task j _j ) The loss value set { L (θ) ₁ ),…，L(θ _t-1 )，L(θ _t+1 ),…，L(θ _T ) Ranking in }. />The value of (2) should also be equal to->And keep the same.

If task t and task i are relatively similar, then the migration information weight from task t to task i should be as large as possible. The similarity of the two tasks can be measured by the similarity of the neural network parameter models of the two tasks, and the similarity is obtained by calculating cosine similarity of parameters of the output layers of the two neural networks, as shown in a formula (24). Let s be _i,t To the degree of similarity of task i and task t, x _i And x _t Representing output layer parameters of the neural network that handle different tasks. At term lambda ₄ In the process, the liquid crystal display device comprises a liquid crystal display device,representation s _i,j Similarity set { s } between task i and all other tasks _i,1 ，s _i,2 ,…，s _i,t Ranking in }.

The weight parameters of each item in the formula (19) are self-adaptive, and the calculation mode is dynamically adjusted according to the training progress of the model and is shown as the formula (25). Where ε=0.01 avoids denominator of 0. Sigma (sigma) _i The loss value of the i-th item in the training formula (19) in the previous round is shown. For the term with larger loss value in the composite loss function, the proportion is reduced appropriately.

Step 5: and calculating the current composite loss function values of all the subtasks, and selecting the task t with the smallest composite loss function value as the next task for parameter updating.

Step 6: updating transfer information vector of task t to other tasks by using Frankwolfe algorithm

The training objective of the multitasking model is to minimize the composite loss function value, so the minimum loss value is found in step 6To complete->Is updated according to the update of the update program. Is provided with->Expressed as equation (26), the minimum problem is solved for a nonlinear objective function, the objective function of which is shown as equation (27), and the constraint is shown as equation (28).

/>

FrankWolfe (FW) step 1: in this problem the objective function is nonlinear, so the objective function will be linearized first. At the feasible pointAfter first-order taylor expansion and elimination of constant terms, the original problem is converted into the following form.

(the specific first-order Taylor expansion process is as follows, with constant term in brackets)

(FW) step 2: the descent direction is found. Is provided withFor the optimal solution of the problem, there is then a formula (29). At this time->Is thatThe descending direction of the position is as follows:

(FW) step 3: the step down step size is found. At this time from the pointAnd starting to perform one-dimensional search and solving the update step length, namely starting to solve the following problem. After solving the step length t, let ∈ ->Because of->In order to be gradient descent direction, there must be +.>

(FW) step 4: and (5) iteratively updating. ObtainingThereafter, the above-described process is repeated until the set number of iterations is reached (set to 20 in this embodiment).

Step 7: updating neural network parameter θ for task t _t 。

Obtaining a neural network parameter theta corresponding to the minimum loss value by using an Adam optimizer _t Back propagation back to neural network completes θ _t Is updated according to the update of the update program.

The above steps are looped until training is completed, as in fig. 4. The whole training process can be regarded as two stages, wherein the first stage is that each scene is independently trained, and steps 2-3 circulate N rounds. The second stage is to perform independent training or multitasking training (training mode is according to I) _pro The value is dynamically adjusted) and the second stage loops from step 4 to step 7 or from step 2 to step 3 (as two possible training modes are used). After one round of circulation is finished, judging the circulation mode of the next round) M rounds can be finished. Where n+m=p, P is the number of rounds the model is trained for. The set training round number P can be ended, and the round number is manually set, and generally has no clear principle and is an empirical value, and the set training round number p=200, n=20 and m=180 in the experiment of this embodiment.

The pseudo code for the multitasking deep reinforcement learning is as follows:

in order to verify the performance of the proposed multitask model and apply the multitask model to practical problems, a high-speed train group regulation and control integrated simulation software is designed, the software can simulate the running process of a train, different train running emergency scenes are set, and finally an adjusted train schedule is generated.

In the simulation software, a line is divided into a plurality of track sections which are connected, and the running of the train on the line is the occupation process of each track section. Firstly, the model is trained according to the training method to obtain a multi-task model (i.e. a neural network Q) _E Is embedded in the simulation software. Parameters of the simulation scene including the interrupt location and the start-stop time of the interrupt are then determined. During the simulation process, the current state vector can be obtained every minute and every train Will->Input to Q _E The train can obtain the Q value of each action and select the action with the highest Q value to execute, and the simulation platform can record the time of each train entering each track section. The process is repeated until all trains reach the terminal station, and the adjusted train schedule can be obtained.

In actual train dispatching command, when the emergency condition needs to be regulated in train operation, the positions of all trains and the occupation condition of each section can be obtained according to actual performance operation information of the trains, the information is formed into a state vector and is input into a trained multitask model, so that an operation regulation scheme of the trains (the arrival time, stop time and interval operation time of the regulated trains at each station and train route information) can be obtained, and a dispatcher is assisted to make decisions and regulation.

The following specifically describes this embodiment taking six scenarios as examples:

to verify the effectiveness of the proposed multitasking deep reinforcement learning model, train weighted late time T derived from greedy algorithm, DDDQN model and multitasking deep reinforcement learning model in 10 test scenarios different from training scenario _r The results of comparison are shown in Table 1. Wherein DDDQN-B and DDDQN-C refer to the B scene and the C scene, respectively And obtaining the weighted train late time in each test scene by the DDDQN model trained down.

Weighted evening time T _r Meanwhile, the influence caused by the train delay time and the train collision is considered, and the calculation method is shown as a formula (30). D in _k Indicating the late time of train k; c=1 indicates that the train is in conflict, otherwise, no conflict occurs; t is t _s,d Indicating the duration of the interrupt. I.e. if there is a train collision, its late time is directly noted as the duration of the break.

It can be seen from table 6 that in most of the test scenarios, the weighted train late time obtained by the multitasking model is lower than the other three models. The DDDQN-B and the DDDQN-C are independently trained models in a single scene, train conflict is very easy to occur when the DDDQN-B and the DDDQN-C are applied to other scenes, the weighted delay time is obviously higher than that of a multi-task model and a greedy algorithm, and train conflict is also a problem which needs to be avoided to the greatest extent in actual train dispatching command. This also illustrates that models trained in a single scenario do not solve the train operation adjustment problem of other scenarios well. The multi-task deep reinforcement learning model has strong robustness, can keep a good optimizing effect in scenes except a test scene, effectively shortens the train delay time, and avoids train collision.

TABLE 6 train late time (in minutes) Table obtained for each model in the test scenario

Further, the scene settings are modified taking into account more complex test scenes. The test scene randomly sets 20% of trains to start at the starting station or stop at the station 1 to 10 minutes (unequal) later than the planned time except for line interruption at a certain position so as to simulate the slight disturbance situation frequently occurring in actual running. Since the length of disturbance and the affected trains are not fixed, 300 tests were performed in each scenario, taking the average of the weighted late times as the result, and the test results are shown in table 7. It can be seen that the results obtained in the scene with the disturbance are similar to those in table 1. The trained model in a single scene has poor results in the scene with disturbance, and train collision is easy to occur. The multitasking model can still keep a good effect, and the train weighting late time is obviously shortened.

TABLE 7 train average weighted evening schedule from models under light disturbance

Example 3

In order to execute the method corresponding to the embodiment 1 to achieve the corresponding functions and technical effects, the following provides an intelligent train dispatching optimization system, which includes:

The actual score operation data acquisition module is used for acquiring actual score operation data of the train to be tested at the current moment.

The scheduling strategy prediction module is used for inputting actual performance operation data into the multi-task deep reinforcement learning model to obtain a scheduling strategy of the train to be tested at the next moment; the multi-task deep reinforcement learning model is obtained by training the dual-pair neural network model by utilizing historical operation data of trains in multiple scenes.

Example 4

The embodiment provides an electronic device, including a memory and a processor, where the memory is configured to store a computer program, and the processor is configured to run the computer program to cause the electronic device to execute a train intelligent scheduling optimization method of embodiment 1.

Wherein the memory is a readable storage medium.

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the system disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

The principles and embodiments of the present invention have been described herein with reference to specific examples, the description of which is intended only to assist in understanding the methods of the present invention and the core ideas thereof; also, it is within the scope of the present invention to be modified by those of ordinary skill in the art in light of the present teachings. In view of the foregoing, this description should not be construed as limiting the invention.

Claims

1. The intelligent train dispatching optimization method is characterized by comprising the following steps of:

2. The intelligent train dispatching optimization method according to claim 1, further comprising, before acquiring the actual train performance operation data at the current time:

Constructing a plurality of deep reinforcement learning models; the deep reinforcement learning model corresponds to the training scene one by one; the deep reinforcement learning model comprises a Q-evaluateNet structural model and a Q-TargetNet structural model;

let the first iteration number i=1;

3. The intelligent train scheduling optimization method according to claim 2, further comprising, before determining the historical operating data of the train in the plurality of training scenarios:

determining any scene in the train as a current scene;

traversing all scenes in the train to obtain a plurality of difference degrees;

4. The intelligent train dispatching optimization method according to claim 2, wherein the training of the deep reinforcement learning model in the i-1 th iteration is performed in parallel by utilizing the historical operation data of the train in the plurality of training scenes until the training rounds of the plurality of deep reinforcement learning models reach the training round threshold value, so as to obtain the deep reinforcement learning model in the i-th iteration, comprising,

Determining any training scene as a current training scene;

inputting a plurality of train state vectors at the current historical moment into a Q-EvaluateNet structure model to obtain a plurality of first action vectors;

inputting a plurality of train state vectors at the next historical moment into the Q-EvaluateNet structural model to obtain a plurality of second motion vectors;

according to the loss function value, the parameters of the Q-Evaluanenet structural model are adjusted by using a gradient descent method, the current historical moment is updated, and the step of constructing a plurality of train state vectors at the current historical moment according to the historical running data of the train is returned until the parameter adjustment times of the Q-Evaluanenet structural model reach a first parameter adjustment times threshold;

and determining the trained Q-EvaluateNet structural model as a deep reinforcement learning model in the ith iteration of the current scene.

5. The intelligent train dispatching optimization method according to claim 2, wherein the multi-task training is performed on the deep reinforcement learning model in the i-1 th iteration by using the historical operation data of the train in the plurality of training scenes, so as to obtain the deep reinforcement learning model in the i-1 th iteration, and the method comprises the following steps:

determining a first task in a task sequence as a current task;

let the second iteration number m=1;

determining a composite loss function of a plurality of tasks;

6. The intelligent train dispatching optimization method according to claim 5, wherein after updating the transfer information vector of the current task to the plurality of non-current tasks by using Frankwolfe algorithm, further comprising:

7. An intelligent train dispatching optimization system is characterized by comprising:

8. An electronic device comprising a memory for storing a computer program and a processor that runs the computer program to cause the electronic device to perform a train intelligent dispatch optimization method according to any one of claims 1 to 6.

9. The electronic device of claim 8, wherein the memory is a readable storage medium.