CN115170006B

CN115170006B - Dispatching method, device, equipment and storage medium

Info

Publication number: CN115170006B
Application number: CN202211095230.1A
Authority: CN
Inventors: 宋轩; 朱世博; 冯德帆; 陈星宇; 朱佳文
Original assignee: Southwest University of Science and Technology
Current assignee: Southwest University of Science and Technology
Priority date: 2022-09-08
Filing date: 2022-09-08
Publication date: 2022-11-29
Anticipated expiration: 2042-09-08
Also published as: CN115170006A

Abstract

The invention discloses a dispatching method, a device, equipment and a storage medium for departure, wherein the method comprises the following steps: carrying out passenger flow simulation of preset total simulation times according to passenger travel data, acquiring sample data in the simulation process, storing the sample data into a memory bank, and acquiring the number of quick departure time periods, total waiting time and a current value network corresponding to the current simulation after each simulation is finished; when the number of the sample data in the memory base reaches a preset number threshold value, randomly selecting a batch of sample data from the memory base according to the preset batch size, and training the latest current value network to obtain the latest current value network; when the simulation times reach the preset total simulation times, determining an optimal current value network corresponding to each quick departure time period quantity; and determining the action of the next departure time period through the optimal current value network corresponding to the number of the fast departure time periods and the state data of the departure time period. The invention can dynamically adjust the departure mode in real time.

Description

Dispatching scheduling method, device, equipment and storage medium

Technical Field

The present invention relates to the field of vehicle dispatching technologies, and in particular, to a method, an apparatus, a device, and a storage medium for dispatching a vehicle.

Background

A mathematical model is constructed to represent the complex change of the number of subway passengers, and the method for finding the optimal scheduling mode is a common method in the scheduling field. Sun L et al constructs three mathematical models for different conditions and solves the optimal subway schedules under different conditions, including subway scheduling without departure number limit constraints, subway scheduling with departure number constraints only for peak periods/off-peak periods, and all-day subway scheduling with departure number constraints. Yang X et al have carried out multiobjective modeling optimization to subway station case, and its target has included the convenience degree of maximize passenger and the cost that the minimum subway was sent out the car and has obtained a series of optimal solutions through solving pareto optimal curve. Kang L et al constructed a Mixed Integer Linear Programming (MILP) model to coordinate departure times for last cars on multiple routes. The model is divided into two small-scale MILP models, and the model is optimized by a WebSphere ILOG CPLEX solver.

In addition, heuristic algorithms are also used for subway scheduling, and compared with a method for constructing a mathematical model, the algorithms can complete a better solution process more quickly. Kuppusamy P et al combines Long Short-Term Memory (Long Short-Term Memory) and Improved Genetic Algorithm (Improved Genetic Algorithm) to optimize the schedule of a single subway station and improve the anti-interference capability of the subway system. Yang S et al use a Non-dominant sequencing Genetic Algorithm (Non-Dominated sequencing Genetic Algorithm II) to optimize the total travel time of the passenger, including the waiting time in line and the riding time.

However, the above scheme has the following disadvantages:

1. most mathematical solutions face a problem: the complexity of the subway model caused by the high-frequency departure of the subway is too high. If the scheduling scheme of the subway system is completely modeled and solved by adopting a mathematical method, the model is difficult to solve due to excessive variables caused by excessive complexity (especially time latitude). The usual mathematical methods are therefore able to consider only a part of the system, for example only a few stations are optimized, only the case of transfer or last car. This makes the solution of this class of solutions neglect many real-world details, which may be less than ideal in a practical application.

2. The accuracy of data recording is low, and in the real world, along with mobile payment, the mode of 'scanning code inbound' is gradually popularized, and it is difficult for subway operators to know the destination of a passenger when the passenger arrives. Some traditional models often need incoming and outgoing information accurate to specific passengers, so that the algorithm cannot be well adapted to the current trend of mobile payment development.

3. The traditional learning scheme, such as departure time given by genetic algorithm, has great limitation in practical application. The algorithm can calculate to obtain the optimal departure schedule of the current day after knowing the complete passenger flow condition of the current day. But the operation of the current day is finished at this time, and the obtained time schedule has no great significance to the past. The timetable obtained by the genetic algorithm is a static timetable, and if the genetic algorithm is adopted, the timetable is trained by using the data of the T day, and the departure frequency of the T +1 day is guided, the people stream movement characteristics appearing in the T +1 day can be difficult to control. If there is a crowd gathering event in the T +1 th day, such as the holding of a large concert, subway scheduling personnel cannot perform more optimal subway scheduling according to the original model, and serious personnel detention can be caused.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: a dispatching method, device, equipment and storage medium for dispatching a train are provided, which can dynamically adjust the dispatching mode in real time.

In a first aspect, the present invention provides a dispatching method, including:

initializing a current value network and a target value network, and dividing the operation time of a train on a line into a preset number of departure time periods;

obtaining passenger trip data of one historical day and the capacity of the train of the line, wherein the passenger trip data comprises the inbound time, the inbound station ID, the outbound station ID and the transit station ID of each passenger, and the transit station ID is determined according to the inbound station and the outbound station through a shortest path algorithm;

according to the passenger travel data and the capacity of the trains of the line, carrying out passenger flow simulation of preset total simulation times, acquiring sample data in the process of each simulation, storing the sample data in a memory library, counting the number of quick departure time periods and the total waiting time corresponding to the current simulation, and determining a current value network corresponding to the current simulation, wherein the first simulation is the passenger flow simulation of the operation time of one day, each sample data comprises state data of a departure time period, action and state data of the next departure time period of the departure time period and a return value, the state data comprises the number of persons in each station on the line, the position of the issued trains and the number of persons in the issued trains, the action is used for departure at a preset quick departure frequency or departure at a preset slow departure frequency, and the action of the next departure time period is determined according to the latest current value network;

when new sample data is stored in a memory base and the number of the sample data in the memory base reaches a preset number threshold value, randomly selecting a batch of sample data from the memory base according to a preset batch size, training a latest current value network according to the batch of sample data, and taking the trained current value network as the latest current value network;

when the simulation times reach the preset total simulation times, determining the optimal current value network corresponding to each quick departure time period according to the quick departure time period number, the total waiting time and the current value network corresponding to each simulation;

selecting an optimal current value network corresponding to the number of the fast departure time periods according to requirements;

and acquiring the state data of the departure time period of the line, and determining the action of the next departure time period of the departure time period through the selected optimal current value network according to the state data of the departure time period.

In a second aspect, the present invention further provides a dispatching device for dispatching a train, including:

the initialization module is used for initializing a current value network and a target value network and dividing the operation time of the train of one line into a preset number of departure time periods;

the system comprises an acquisition module, a traffic information acquisition module and a traffic information processing module, wherein the acquisition module is used for acquiring passenger trip data of one historical day and the capacity of a train of the line, the passenger trip data comprises inbound time, inbound station ID, outbound station ID and transit station ID of each passenger, and the transit station ID is determined according to the inbound station and the outbound station through a shortest path algorithm;

the simulation module is used for carrying out passenger flow simulation of preset total simulation times according to the passenger travel data and the capacity of the trains of the line, acquiring sample data in the process of each simulation, storing the sample data in a memory library, counting the number of quick departure time periods and the total waiting time corresponding to the current simulation, and determining a current value network corresponding to the current simulation, wherein the one-time simulation is the passenger flow simulation of the operation time of one day, each sample data comprises state data of a departure time period, action and state data of the next departure time period of the departure time period and a return value, the state data comprises the number of persons in each station on the line, the position of the train which has been sent out and the number of persons in the train which has been listed, the action is the departure with a preset quick departure frequency or the departure with a preset slow departure frequency, and the action of the next departure time period is determined according to the latest current value network;

the training module is used for randomly selecting a batch of sample data from the memory base according to the preset batch size when new sample data is stored in the memory base and the number of the sample data in the memory base reaches a preset number threshold, training the latest current value network according to the batch of sample data, and taking the trained current value network as the latest current value network;

the first determining module is used for determining the optimal current value network corresponding to each quick departure time period according to the quick departure time period number, the total waiting time and the current value network corresponding to each simulation when the simulation times reach the preset total simulation times;

the selection module is used for selecting an optimal current value network corresponding to the number of the time periods of quick departure according to the requirement;

and the second determining module is used for acquiring the state data of the departure time period of the line and determining the action of the next departure time period of the departure time period through the selected optimal current value network according to the state data of the departure time period.

In a third aspect, the present invention also provides an electronic device, including:

one or more processors;

storage means for storing one or more programs;

when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the departure scheduling method as provided in the first aspect.

In a fourth aspect, the present invention also provides a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the departure scheduling method as provided in the first aspect.

The invention has the beneficial effects that: as the DQN algorithm is trained, the intelligent agent is enabled to make the departure action selection of the next time period by giving the environmental state of the current moment when the DQN algorithm is deployed in the actual production life, the departure decision of the next departure time period can be given in real time only by giving the states of the train and the station of the current moment, so that the departure mode can be dynamically adjusted in real time according to the current people flow condition, the rail traffic pressure is effectively reduced, and the DQN algorithm has excellent self-adaption capability to the emergency conditions such as abnormal people flow; meanwhile, the optimal models corresponding to different fast departure time periods are generated along with the learning process of the intelligent agent in the training process, and a worker can select a proper model to deploy according to the actual situation.

Drawings

Fig. 1 is a flowchart of a departure scheduling method according to the present invention;

fig. 2 is a schematic structural diagram of a dispatching device for dispatching a train in accordance with the present invention;

fig. 3 is a schematic structural diagram of an electronic device provided in the present invention;

fig. 4 is a flowchart of a departure scheduling method according to a first embodiment of the present invention;

FIG. 5 is a flow chart illustrating a passenger flow simulation according to a first embodiment of the present invention;

FIG. 6 is a flow chart illustrating passenger flow simulation during a departure time period according to a first embodiment of the present invention;

FIG. 7 is a flowchart of a method of step S404 according to a first embodiment of the present invention;

fig. 8 is a schematic diagram of training results of a reinforcement learning model when the number of fast departure time periods is 1 to 6 according to the first embodiment of the present invention;

FIG. 9 is a schematic diagram of the training results of the reinforcement learning model when the number of fast departure time periods is 7-14 according to the first embodiment of the present invention;

fig. 10 is a schematic diagram of a training result of the reinforcement learning model when the number of the fast departure time periods is 15 to 24 according to the first embodiment of the present invention;

fig. 11 is a schematic diagram of an optimal solution of different fast departure time period numbers according to a first embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Before discussing exemplary embodiments in more detail, it should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart may describe the steps as a sequential process, many of the steps can be performed in parallel, concurrently, or simultaneously. In addition, the order of the steps may be rearranged. A process may be terminated when its operations are completed, but may have additional steps not included in the figure. Processing may correspond to methods, functions, procedures, subroutines, sub-computer programs, and the like.

Furthermore, the terms "first," "second," and the like may be used herein to describe various orientations, actions, steps, elements, or the like, but the orientations, actions, steps, or elements are not limited by these terms. These terms are only used to distinguish one direction, action, step or element from another direction, action, step or element. For example, first information may be referred to as second information, and similarly, second information may be referred to as first information, without departing from the scope of the present application. The first information and the second information are both information, but they are not the same information. The terms "first", "second", etc. are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless explicitly specified otherwise.

As shown in fig. 1, a departure scheduling method includes:

s101: initializing a current value network and a target value network, and dividing the operation time of a train on a line into a preset number of departure time periods;

s102: obtaining passenger trip data of one historical day and the capacity of the train of the line, wherein the passenger trip data comprises the inbound time, the inbound station ID, the outbound station ID and the transit station ID of each passenger, and the transit station ID is determined according to the inbound station and the outbound station through a shortest path algorithm;

s103: according to the passenger travel data and the capacity of the trains of the line, carrying out passenger flow simulation of preset total simulation times, acquiring sample data in the process of each simulation, storing the sample data in a memory library, counting the number of quick departure time periods and the total waiting time corresponding to the current simulation, and determining a current value network corresponding to the current simulation, wherein the first simulation is the passenger flow simulation of the operation time of one day, each sample data comprises state data of a departure time period, action and state data of the next departure time period of the departure time period and a return value, the state data comprises the number of persons in each station on the line, the position of the issued trains and the number of persons in the issued trains, the action is used for departure at a preset quick departure frequency or departure at a preset slow departure frequency, and the action of the next departure time period is determined according to the latest current value network;

s104: when new sample data is stored in a memory base and the number of the sample data in the memory base reaches a preset number threshold value, randomly selecting a batch of sample data from the memory base according to a preset batch size, training a latest current value network according to the batch of sample data, and taking the trained current value network as the latest current value network;

s105: when the simulation times reach the preset total simulation times, determining the optimal current value network corresponding to each quick departure time period according to the quick departure time period number, the total waiting time and the current value network corresponding to each simulation;

s106: selecting an optimal current value network corresponding to the number of the time periods of quick departure according to the requirement;

s107: and acquiring the state data of the one-vehicle time period of the one line, and determining the action of the next one-vehicle time period of the one-vehicle time period through the selected optimal current value network according to the state data of the one-vehicle time period.

The traditional algorithm, such as a genetic algorithm, can only calculate the optimal solution of the previous day through the pedestrian flow condition of the subway of the previous day, and the obtained result is used for the current day under the assumption that the pedestrian flow condition of the current day is approximately the same as that of the previous day. Therefore, the traditional algorithm ignores the stream characteristics of many people on the same day and cannot make dynamic adjustment in time. When the DQN algorithm is trained, the selection of the departure action of the intelligent body in the next time period is made by giving the environmental state of the intelligent body at the current moment, so that when the DQN algorithm is deployed in the actual production life, the departure decision of the next departure time period can be given in real time only by giving the states of the subway and the subway station at the current moment, the departure mode can be dynamically adjusted in real time according to the current pedestrian flow condition of the subway, the rail traffic pressure is effectively reduced, and the DQN algorithm has excellent self-adaption capability to the emergency conditions such as abnormal pedestrian flow.

In an optional embodiment, the step S103 includes:

presetting the action of a first departure time period of operation time in the ith simulation, and taking the first departure time period as the current departure time period, wherein the initial value of i is 1;

according to the action of the current departure time period, passenger travel data, capacity and preset unit time, carrying out passenger flow simulation of the current departure time period, acquiring state data when the current departure time period is ended as the state data of the current departure time period, and meanwhile, counting the total waiting time of the current departure time period according to the number of waiting passengers at each station on the line in each unit time in the current departure time period;

generating a random number, wherein the range of the random number is 0-1;

if the random number is smaller than the exploration rate corresponding to the ith simulation, randomly generating the action of the next departure time period of the current departure time period;

if the random number is greater than or equal to the exploration rate corresponding to the ith simulation, determining the action of the next departure time period of the current departure time period according to the latest current value network;

according to the action of the next departure time period, passenger travel data, capacity and preset unit time, carrying out passenger flow simulation of the next departure time period, acquiring state data when the next departure time period is ended as the state data of the next departure time period, and meanwhile, according to the number of waiting passengers at each station on the line in each unit time in the next departure time period, counting the total waiting time of the next departure time period;

calculating a return value according to the total waiting time of the current departure time period, the action of the next departure time period and a penalty item function corresponding to the jth iteration, wherein j = 8968and i/epoch \8969his the preset simulation times of each iteration;

generating sample data according to the state data of the current departure time period, the action and the return value of the next departure time period and the state data of the next departure time period, and storing the sample data in a memory bank;

judging whether the next departure time period is the last departure time period of the operation time or not;

if not, taking the next departure time period as the current departure time period, and continuing to execute the step of generating the random number;

if so, counting the number of departure time periods for departure at a fast departure frequency in the ith simulation to obtain the number of the fast departure time periods corresponding to the ith simulation, calculating the total waiting time corresponding to the ith simulation according to the total waiting time of each departure time period of the operating time in the ith simulation, and taking the current latest current value network as the current value network corresponding to the ith simulation;

judging whether i is equal to a preset total simulation number;

and if not, determining the (i + 1) th simulated corresponding exploration rate according to the exploration rate corresponding to the ith simulation and a preset minimum exploration rate, wherein the exploration rate corresponding to the first simulation is a preset exploration rate initial value, and i = i +1, continuously executing the action of the first departure time period of the operation time in the preset ith simulation, and taking the first departure time period as the current departure time period.

And performing primary simulation, namely performing passenger flow simulation of the operation time of one day, generating sample data and storing the sample data into a memory library by performing the passenger flow simulation so as to train the current value network later, and generating new sample data based on the latest current value network in real time in the simulation process.

In an optional embodiment, the passenger flow simulation in the current departure time period according to the action in the current departure time period, the passenger travel data, the capacity amount and the preset unit time includes:

taking the first unit time of the current departure time period as the current unit time;

respectively judging whether a train arrives at each station on the line in the current unit time according to the action of the current departure time period and preset train operation data;

if a train arrives at a station, people flow interactive processing is carried out on the station according to passenger trip data and the capacity of the train, the people flow interactive processing comprises the steps that passengers get off the station and passengers get on the station, the passengers getting on the station comprise passengers who get on the station and have the station ID of the station or the station ID of the transit station, the passengers in the station comprise passengers who get on the station and have the station ID of the station, the passengers in the station comprise passengers who get on the station and transfer passengers, the passengers comprise passengers who have the station ID of the station and have the station ID of the station, the time of arriving at the station exceeds the preset transfer time, and the station corresponding to the station ID of the station is the passenger on the station and have no train;

updating the number of people in the station of the station and the number of people in the train which is sent out according to the people flow interaction processing result, and counting the number of people waiting for the train in the current unit time;

if no train arrives at a station, updating the number of people in the station of the station according to passenger trip data and the station ID of the station, and counting the number of people waiting for the station in the current unit time;

counting the total number of waiting passengers in the current unit time according to the number of waiting passengers at each station on the line in the current unit time;

judging whether the current unit time is the last unit time of the current departure time period or not;

and if not, taking the next unit time as the current unit time, continuing to execute the step of respectively judging whether a train arrives at each station on the line in the current unit time according to the action of the current departure time period and preset train operation data.

The simulation system can simulate the flow of people by taking unit time as a minimum division value and calculate the total waiting time of all passengers, and the simulation system also supports transfer operation, and transfer routes are given by a shortest path algorithm. During each departure time period.

In an alternative embodiment, the act of determining the next departure time period from the current departure time period based on the latest current value network comprises:

according to the state data of the current departure time period and a preset action set, calculating the scores of all actions in the action set through a latest current value network, and taking the action corresponding to the maximum score as the action of the next departure time period, wherein the action set comprises the steps of departure with a preset fast departure frequency and departure with a preset slow departure frequency.

That is, when the generated random number is not less than the exploration rate corresponding to the current simulation, the optimal action is determined based on the latest current value network, and is taken as the action in the next vehicle-issuing time period.

In an optional embodiment, the calculating a return value according to the total waiting time of the current departure time period, the action of the next departure time period and a penalty term function corresponding to the jth iteration includes:

calculating a return value according to a return value calculation formula, wherein the return value calculation formula is r = -C _t ^k -a(f _j (x)-f _j (x-1)), wherein r is a reported value, C _t ^k If the action of the next departure time period is to proceed departure at a preset fast departure frequency, a =1, and if the action of the next departure time period is to proceed departure at a preset slow departure frequency, a =0,f _j (x) And (4) performing a corresponding penalty term function for the j-th iteration.

Further, before letting i = i +1, the method further includes:

judging whether i is equal to the integral multiple of the simulation times of each iteration;

if yes, updating the formula according to the penalty term function anddetermining a penalty term function corresponding to the j-th iteration, and updating a formula of the penalty term function to be f _j+1 (x)=K _new ×Smooth‍（C _best,j (x)）+K _old ×f _j (x) Wherein f is _j+1 (x) Penalty function corresponding to iteration of round j +1, f _j (x) A penalty function corresponding to the jth iteration, smooth () is a Smooth function, C _best,j (x) Represents the minimum total waiting time, K, corresponding to the number x of the fast departure time periods in the jth iteration _new And K _old For preset regulating parameters, f ₁ (x)=x∙M ₀ X represents the number of fast departure time periods, M ₀ Is a preset punishment item for single quick departure.

That is to say, each time a round of iterative simulation is performed, that is, the penalty term function f (x) is updated, so that the model effect is better, and the model is more universal to cope with subway environments of different cities.

In an alternative embodiment, after step S102, the method further includes:

setting the actions of each departure time period in the operation time as departure at a preset slow departure frequency, performing slow departure simulation once according to the passenger trip data and the capacity of the trains on the first line, and counting to obtain the theoretical longest waiting time according to the total waiting time of each departure time period in the slow departure simulation;

setting the actions of each departure time period of the operation time as departure frequency preset, carrying out one-time rapid departure simulation according to the passenger travel data and the capacity of the trains of the one line, and counting the total waiting time of each departure time period in the rapid departure simulation to obtain the theoretical shortest waiting time;

and dividing the difference between the theoretical longest waiting time and the theoretical shortest waiting time by the total number of departure time periods in a preset day to obtain a penalty item for single-time express departure.

That is, the penalty term for a single fast departure can be regarded as a reduction value of the total waiting time caused by adding one fast departure time period.

In an optional embodiment, the determining the exploration rate of the i +1 th simulated correspondence according to the exploration rate corresponding to the i-th simulation and a preset minimum exploration rate includes:

determining the exploration rate corresponding to the i +1 th simulation according to an exploration rate updating formula, wherein the exploration rate updating formula is epsilon _i+1 =max（ε _min ，ε _i + 0.0045), wherein ε _i+1 For the i +1 th simulation, the corresponding exploration ratio, ε _i For the ith simulation of the corresponding exploration rate, epsilon _min Is a preset minimum exploration rate.

Wherein the preset initial value epsilon of the exploration rate ₁ =1, preset minimum search rate ∈ _min =0.1。

That is, the search rate is updated every time a simulation is performed.

In an optional embodiment, before storing the sample data in the memory library, the method further includes:

if the memory bank is full, deleting the sample data stored in the memory bank earliest.

Ensuring that the sample data stored in the memory base is the latest sample data.

In an optional embodiment, the step S104 includes:

when new sample data is stored in the memory base and the number of the sample data in the memory base reaches a preset number threshold, randomly selecting sample data with a preset batch size from the memory base as the sample data of the current batch, and using the latest current value network as the current value network to be trained;

traversing the sample data of the current batch, and sequentially acquiring sample data from the sample data of the current batch;

calculating the score corresponding to the state data of the current departure time period and the action of the next departure time period in the sample data through the latest current value network to be trained, and taking the score as a first score corresponding to the sample data;

respectively calculating scores of the state data of the next vehicle-starting time period in the sample data corresponding to each action through a latest target value network, and taking the maximum score as a second score corresponding to the sample data;

calculating a loss value according to a return value in the sample data, a first score and a second score corresponding to the sample data and a preset discount rate, and updating the latest network parameters of the current value network to be trained according to the loss value;

and after traversing the sample data of the current batch, taking the latest current value network to be trained as the latest current value network.

That is, after training according to a complete batch of sample data, it is calculated that a complete update is performed on the current value network.

In an optional embodiment, further comprising:

and when the simulation times reach integral multiples of the preset first times, updating the network parameters of the target value network according to the latest network parameters of the current value network.

That is, each time a certain number of passenger flow simulations are performed, the network parameters of the target value network are replaced with the network parameters of the current value network which is the latest at present. The alternative is that the neural network can learn further. Before the next replacement occurs, the network parameter θ' of the target value network is fixed and unchanged, and only the network parameter θ of the current value network is changed by the training of step S104.

In an optional embodiment, the calculating a loss value according to the return value in the sample data, the first score and the second score corresponding to the sample data, and the preset discount rate includes:

calculating a Loss value according to a Loss function, the Loss function being Loss = (Q) _target (s,a)-Q _evel (s,a)） ² ，Q _target (s,a)=r+γ×max _a'∈A Q (s ', a'), wherein Loss is the Loss value, Q _evel (s, a) is the first score corresponding to the sample data, r is the return value in the sample data, γ is the predetermined discount rate, max _a'‍∈‍A Q(s’And a') is a second score corresponding to the sample data.

In an optional embodiment, step S105 includes:

and when the simulation times reach the preset total simulation times, comparing the total waiting time corresponding to each simulation of the same quick departure time period quantity, and taking the current value network corresponding to the simulation of the time period quantity with the minimum corresponding total waiting time as the optimal current value network corresponding to the same quick departure time period quantity.

Considering that the number of the quick departure time periods is limited, the optimal models corresponding to all possible quick departure time periods are recorded. In actual production life, an operator can select the optimal models corresponding to the required quick departure time periods according to actual requirements.

As shown in fig. 2, the present invention also provides a dispatching device for dispatching a train, including:

an initialization module 201, configured to initialize a current value network and a target value network, and divide the operation time of a train on a route into a preset number of departure time periods;

the obtaining module 202 is configured to obtain passenger trip data of a historical day and a capacity of a train of the one route, where the passenger trip data includes inbound time, inbound station ID, outbound station ID, and transit station ID of each passenger, and the transit station ID is determined by a shortest path algorithm according to the inbound station and the outbound station;

the simulation module 203 is used for performing passenger flow simulation of a preset total simulation number according to the passenger travel data and the capacity of the trains on the line, acquiring sample data in the process of each simulation, storing the sample data in a memory library, counting the number of quick departure time periods and the total waiting time corresponding to the next simulation after each simulation is finished, and determining a current value network corresponding to the current simulation, wherein the one-time simulation is the passenger flow simulation of the operation time of one day, each sample data comprises state data of a departure time period, action and state data of the next departure time period of the departure time period and a return value, the state data comprises the number of persons in stations of each station on the line, the position of the issued trains and the number of persons in issued vehicles, the action is used for issuing the train at a preset quick departure frequency or issuing the train at a preset slow departure frequency, and the action of the next departure time period is determined according to the latest current value network;

a training module 204, configured to randomly select a batch of sample data from a memory according to a preset batch size when new sample data is stored in the memory and the number of the sample data in the memory reaches a preset number threshold, train a latest current value network according to the batch of sample data, and use the trained current value network as the latest current value network;

the first determining module 205 is configured to determine, when the simulation times reach a preset total simulation times, an optimal current value network corresponding to each number of fast departure time periods according to the number of fast departure time periods, the total waiting time, and the current value network corresponding to each simulation;

a selecting module 206, configured to select, according to a requirement, an optimal current value network corresponding to a number of time periods of fast departure;

the second determining module 207 is configured to obtain the state data of the departure time period of the one line, and determine the action of the next departure time period of the departure time period according to the state data of the departure time period and through the selected optimal current value network.

As shown in fig. 3, the present invention also provides an electronic device, including:

one or more processors 301;

a storage device 302 for storing one or more programs;

when executed by the one or more processors 301, the one or more programs cause the one or more processors 301 to implement the departure scheduling method as described above.

The present invention also provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the departure scheduling method as described above.

Example one

Referring to fig. 4-11, a first embodiment of the present invention is: a dispatching method for dispatching trains can be applied to dispatching trains of subways.

In view of the problems of the existing solutions, such as long computation time, high complexity, too many neglected detail features, and incapability of adjusting the generated static schedule and coping with emergency situations, in the embodiment, a Deep Reinforcement Learning algorithm (Deep Reinforcement Learning) is used to overcome the problems.

Firstly, a subway environment simulation system accurate to minutes is established by using subway data of a certain city. Because the real operation condition of the reference subway network is built and the distance measurement is carried out by utilizing the subway operation time, the method has good transportability among different cities. Specifically, the data of the simulated environment is passenger travel data of a certain historical time period in a certain city, and mainly comprises the inbound time, the inbound station ID and the outbound station ID of a passenger. The simulation environment will perform a simulation of the flow of people throughout the subway system with a minimum division of minutes and calculate the total waiting time of all people.

In addition to the basic single-wire simulation, the simulation environment also supports transfer operations. Since the transfer passenger also has only records of inbound and outbound stations in the raw data, the optimal (least time-consuming) transfer station is assigned to it using the dijkstra algorithm, i.e. the transfer route of the passenger is given by calculating the shortest path between the two stations by the dijkstra algorithm.

Assuming that the departure frequency of the subway can be changed every 30 minutes, the simulation environment generates a departure schedule in the next 30 minutes according to the departure frequency of the given subway, and simulates frame by frame in the period, wherein the duration of each frame is 1min. With a departure time period T ₀ For example, =30min, the time interval between departure of the simulated environment in the current departure time period is t _k Subway departure simulation in minutes at intervals of t ₀ =1min self-refresh, including whether to execute departure action, update train position, update number of outgoing and incoming persons for each station, if yesWhen the train arrives at the station, the interaction between the train and the number of people in the station can be carried out according to the passenger travel, and the practical indexes such as overload of the train and the like are ensured. Meanwhile, the fact that the card swiping position of most subway stations is close to the getting-on position is considered, and therefore the time that passengers enter the subway stations and get out of the subway stations is not considered in the simulation system. However, due to the arrangement of the subway line, it often takes a certain time for passengers to walk to the transfer station, so that the passengers are transferred at t in the design station in order to fully consider the transfer time of the passengers ₁ The subway station is in a transfer state within the time, namely the subway station cannot be transferred to the next subway station immediately.

And on the basis of the simulation environment, selecting a Deep Q-Network (DQN) algorithm in Deep reinforcement learning to make a departure strategy. In order to solve the subway scheduling problem better by DQN, the problem is abstracted and simplified properly.

The operation time of the subway is uniformly divided into a plurality of departure time periods, for example, assuming that the operation time of the subway is from 6 am to 24 pm, the total time is 18 hours, the operation time of the subway is uniformly divided into 36 departure time periods, and the duration of each departure time period is half an hour. The DQN algorithm determines the frequency of departure in the next departure time period according to the subway running state at the current moment, the number of people in the station and other factors. The timetable used by the authorities divides the departure intervals into two categories, the peak period and the peak period, the peak period is 6 a.m.: 30 to 8:30 and 16 pm: 30 to 18:30, the departure frequency of the subway is increased in the peak period, and the rest time is considered as the peak leveling period. In order to meet the actual conditions, a 36-bit variable of 0 and 1 is used as an indicator of the departure frequency of a day, wherein 0 represents slow departure frequency in a flat peak period, and 1 represents rapid departure frequency in a peak period. Slow departure, fast departure, represented by 0 and 1, will be simulated by mapping to a specific departure interval according to the current route, e.g. 8 min/time for slow departure and 3 min/time for fast departure.

In terms of effectiveness metrics, the evaluation is done by calculating the total waiting time T for all passengers on the line. Therefore, the subway scheduling problem is abstracted into a double-target optimization problem of selecting n 1 (0 ≦ n ≦ 36) from a binary vector with the dimension of 36, and enabling T to be as small as possible under the condition that n is as small as possible.

The size of the complete solution space due to the above problem is

Even if a single-objective optimization problem is considered, namely, the number of the departure time periods of the quick departure is limited to 8 (namely, the number of the quick departure time periods is the same as that of the existing official departure timetable, but the positions of the 8 1 in the binary vector are not limited), the method also has the advantages that the method can be used for optimizing the departure time period of the quick departure time period, and the method is suitable for the system

The total time of the simulation of each day is about 1min, and the violent solution needs about 57.6 years when only 8 quick departure time periods are considered. Therefore, it is necessary to adopt a more efficient solution algorithm.

So far, the subway scheduling problem is abstracted into a dual-target optimization problem which can be solved by a DQN algorithm. In this embodiment, for the dual-target optimization problem, a set of pareto optimal solution sets is obtained after traversing all values of one of the targets.

The training part of the DQN model is mainly divided into two stages, namely an exploration stage and a learning stage.

In the exploration phase, the DQN model continuously acquires the state of the current environment from the simulation environment, and because the time precision of the simulation environment is 1min, the state information of the subway/subway station in the simulation process can be easily acquired from the simulation environment. However, since the action output of DQN is once every 30min, the state of the simulation environment only needs to be sliced every 30min (simulation time), and the number of people in the real-time station of each subway station, the real-time position of the sent subway, and the number of people on the train having sent the subway are taken as characteristics, that is, input as input data (also called as state data s) into the deep neural network part in the DQN model, and the deep neural network calculates Q values Q (s, a) corresponding to all actions, a ∈ a, that is, calculates a score given by the network for executing action a in the state s (it can be understood that the total waiting time in the departure time period, the penalty term brought by the current total number of fast-departure trains, and both terms before considering in the future are converted into a score according to a certain proportion). And then selecting the action a with the maximum Q value and executing the action a. After the action is performed, the environment is changed, and a reward r of the environment is obtained and the record is stored in the memory base of the DQN. Meanwhile, the above actions are repeated along with the change of the environment until the memory bank is full.

In the learning phase, besides continuing to acquire new information and update the memory base in the exploration phase, the DQN model randomly extracts memory from the memory base to train the neural network portion therein. The neural network part has two networks of the same structure: a Q-target network and a Q-estimation network. The Q-target network is relatively fixed and stores previously learned knowledge, and the Q-estimation network is continuously updated along with the learning process and updates the network parameters of the Q-target network after a certain number of iterations.

The Q value of the DQN model is updated in the following way:

the first formula: q _target (s,a)=r(s,a)+γ×max _a'∈A Q(s’,a’)

The second formula: loss = (Q) _target (s,a)-Q _evel (s,a)） ²

Wherein (s, a, s ', a') in the first formula respectively represent (current state, current (in-state) motion, next state, next (in-state) motion), that is, a represents motion generated according to the current state s, and s 'represents state obtained after the state s passes through the motion a, so in this embodiment, s can be regarded as the environmental state when the current departure time period ends, a can be regarded as the motion of the next departure time period, and s' can be regarded as the environmental state when the next departure time period ends; r (s, a) is a reward function; gamma is the discount rate, and the value range is 0-1; a is a preset action set, and A = {0,1}; q _target (s, a) represents the Q-value score given by the target network at (s, a), given by the sum of the two parts, where beforeOne term r (s, a) is the score for the current state and current action, the latter term representing the estimate for the future, max _a'∈A Q (s ', a ') gives the Q value for selecting the best action in state s ', while the discount rate γ will reduce the future estimate to the current state in a certain proportion.

The second formula is a loss function of the estimated network, and the estimated network converges toward the first formula before updating the target network with the estimated network every other number of steps.

In the experimental process, the design of the return function in the DQN algorithm is found to play a crucial role in the result of the subway scheduling optimization problem. In contrast, after the distribution of the solution space to the subway optimization problem is combined, an effective return function is designed for the r value, which is specifically as follows:

the third formula: r = -C _t ^k -a∙M

Wherein, C _t ^k And a belongs to {0,1}, wherein a =0 represents that the action of the current departure time period is slow departure, namely departure is carried out at a preset slow departure frequency, and a =1 represents that the action of the current departure time period is fast departure, namely departure is carried out at a preset fast departure frequency. The reward function is designed with the expectation that the total waiting time can be reduced by at least M minutes for each additional quick departure.

Through further analysis, the value of M in the initial return function can be adjusted to be closer to the shape of a theoretical upper bound of a solution space, so that a better solution can be obtained, because the method is beneficial to the DQN algorithm to perform parallel exploration when the upper bound of the optimal solution is approached, and the DQN algorithm is not trapped in a certain local optimal solution. In order to make the model effect better and have universality to deal with subway environments of different cities, a method for dynamically adjusting M is also designed. M at this time is a function of the number x of fast departure time periods, and is represented by F (x) = F (x) -F (x-1), and F (x) can be understood as a lower limit of a waiting time which is expected to be reduced when the number of fast departure time periods in a day is x, and 0< = x < =36 in the present embodiment.

The fourth formula: f. of ₁ (x)=x∙M ₀

The fifth formula: m ₀ = (theoretical longest waiting time-theoretical shortest waiting time)/total number of departure time periods

The sixth formula: f. of _i+1 (x)=K _new ×Smooth（C _best,i (x)）+K _old ×f _i (x)，

Wherein, in the sixth formula, f _i+1 (x) Represents a function for the (i + 1) th iteration; smooth () is a smoothing function; c _best,i (x) The function is obtained by recording the minimum total waiting time under the optimal departure model with x number of the quick departure time periods in the ith round of iteration. K _new And K _old Are two adjustment parameters for controlling the update step size of f (x) in an iteration.

Further, in an actual process, due to the limited number of simulation times in a single iteration and the preferential property of the DQN algorithm, solutions may occur without a large number of fast departure time periods. For example, in fig. 11, when the number of fast departure time periods is greater than 28, it is found that the algorithm does not give a solution in this case. This is because under the condition that the number of the fast departure time periods is large, the total waiting time cannot be shortened remarkably by increasing the number of the fast departure time periods, but extra penalty is given to extra fast departure. Therefore, during iteration, a previous latest "reasonable value" needs to be found for iteration, that is, if there is no solution corresponding to a certain departure time period number in the jth iteration, the solution corresponding to the departure time period number in the ith-1 iteration is used for substitution, and if there is no corresponding solution in the ith-1 iteration, the corresponding solution in the ith-2 iteration is continuously found forward, and the first iteration is directly found.

Based on the above analysis, as shown in fig. 4, the departure scheduling method of the present embodiment includes the following steps:

s401: initializing a current value network, a target value network and a hyper-parameter, and dividing the operation time of the trains of a line into a preset number of departure time periods.

Wherein, the hyper-parameters comprise iteration round number and each iteration roundThe simulation times epoch, the memory capacity, the discount rate γ, the initial value of the search rate, etc. In this embodiment, the number of iteration rounds is 3, the simulation time epoch =2250 for each iteration round, and the initial value of the search rate ∈ is set ₁ =1。

In this embodiment, the length of the departure time period is 30min, and the operation time of the train on the route is divided. For example, assuming that the operation time is from 6 am to 24 pm per day, the operation time may be divided into 36 departure time periods.

S402: obtaining historical one-day passenger travel data and the capacity of the train on one line, wherein the passenger travel data comprise the inbound time, the inbound station ID, the outbound station ID and the transit station ID of each passenger.

The ID of the transit station is determined by a shortest path algorithm (such as Dijkstra algorithm) according to the inbound station and the outbound station, namely, in a subway line diagram, each station is taken as a node, a subway road is taken as a side, the length of the subway road is taken as a weight of the side, then the inbound station is taken as a starting point, the outbound station is taken as an end point, and the shortest path is calculated by the shortest path algorithm.

After the passenger travel data are obtained, preliminary screening can be performed firstly, and passenger travel data with the inbound station ID, the outbound station ID or the intermediate station ID matched with the station ID on the line are screened out, namely the passenger travel data related to the line are screened out firstly.

In this embodiment, the travel data of passengers on a certain historical day is acquired, and then the data is used to perform passenger flow simulation for the operation time of a complete day, that is, one simulation is considered to be performed.

S403: and carrying out passenger flow simulation of the total number of times of preset simulation according to the passenger travel data and the capacity of the train on the line, acquiring sample data in the process of each simulation, storing the sample data into a memory bank, and obtaining the number of the quick departure time periods, the total waiting time and the current value network corresponding to the current simulation after each simulation is finished.

In this embodiment, the total number of simulations = number of iterations × number of simulations per iteration =3 × 2250, that is, three iterations are performed, and 2250 simulations are performed per iteration.

Since the simulation is performed for one day, after each simulation is finished, the number of the fast departure time periods (i.e., the number of departure time periods during which departure is performed at the preset fast departure frequency in one day), the total waiting time (i.e., the total waiting time of all passengers on the route in one day), and the current value network (i.e., the latest current value network when the simulation is finished) corresponding to each simulation are obtained.

In this embodiment, the number of times of simulation is counted, and the number of times of simulation is increased by one every time of simulation.

Specifically, as shown in fig. 5, the process of performing a passenger flow simulation includes the following steps:

s501: and initializing a simulation environment, and taking the first departure time period as the current departure time period.

S502: and according to the action of the current departure time period, passenger travel data, capacity and preset unit time, carrying out passenger flow simulation of the current departure time period, acquiring state data when the current departure time period is ended as the state data of the current departure time period, and meanwhile, counting the total waiting time of the current departure time period according to the number of the passengers waiting at each station on the line in each unit time in the current departure time period.

The first departure time period is a preset action, the action comprises two actions of departure at a preset fast departure frequency (hereinafter referred to as fast departure) and departure at a preset slow departure frequency (hereinafter referred to as slow departure), and one departure time period corresponds to one action. The status data includes the number of persons at each stop on the line, the location of the issued train and the number of persons in the train that have been issued.

S503: generating a random number, wherein the range of the random number is 0-1.

S504: and judging whether the random number is smaller than the exploration rate corresponding to the simulation, if so, executing step S505, and if not, executing step S506.

Wherein, the first simulation is corresponding to the search rate, i.e. the initial value epsilon of the search rate ₀ =1, and the search rate for each subsequent simulation is determined based on the search rate for the previous simulation and a preset minimum search rate, specifically, the search rate update formula is ∈ _i =max（ε _min ，ε _i-1 + 0.0045), wherein ε _i For the ith simulation the corresponding exploration rate, ε _i-1 For the i-1 th simulation of the corresponding exploration rate, epsilon _min In this example,. Epsilon. _min =0.1。

S505: and randomly generating the action of the next departure time period of the current departure time period.

S506: and determining the action of the next departure time period of the current departure time period according to the latest current value network.

Specifically, state data s of the current departure time period and an action a in a preset action set A are input into a current value network, and the current value network outputs the state data and a score Q (s, a) of the action. After the scores of all the actions in the action set are obtained through the latest current value network during the simulation, the action corresponding to the maximum score is taken as the action of the next vehicle-starting time period, namely argmax _a∈A Q(s,a,θ _evel ) Wherein, theta _evel Network parameters representing the current, up-to-date, current value network.

S507: and according to the action of the next departure time period, the passenger travel data, the capacity and the preset unit time, carrying out passenger flow simulation of the next departure time period, acquiring state data when the next departure time period is ended as the state data of the next departure time period, and meanwhile, counting the total waiting time of the next departure time period according to the number of the waiting passengers at each station on the line in each unit time in the next departure time period.

S508: and calculating a return value according to the total waiting time of the current departure time period, the action of the next departure time period and a penalty item function corresponding to the iteration of the current round, wherein the return value r of the tuple (s, a, s ') is calculated, s is the state data of the current departure time period, a represents the action of the next departure time period, and s' is the state data of the next departure time period.

Specifically, the return value calculation formula is r = -C _t ^k -a(f _j (x)-f _j (x-1)), wherein r is a reported value, C _t ^k The total waiting time of the current departure time period; a represents the action of the next departure time period, and in this embodiment, a =1 if the action of the next departure time period is fast departure, and a =0 if the action of the next departure time period is slow departure.

f _j (x) A penalty term function corresponding to the j-th iteration is taken as an initial value f of the penalty term function ₁ (x)=x∙M ₀ X represents the number of time periods of quick departure, M ₀ Is a preset penalty item for single quick departure. And then, updating the penalty term function corresponding to each iteration according to the penalty term function corresponding to the previous iteration, wherein the penalty term function updating formula is f _j+1 (x)=K _new ×Smooth(C _best,j (x))+K _old ×f _j (x) Wherein Smooth () is a smoothing function, C _best,j (x) The function obtains K by recording the minimum total waiting time under the optimal departure model with the number of the quick departure time periods x in the jth iteration process _new And K _old For preset regulating parameters for controlling f _j (x) Update step size in an iteration.

Therefore, the return value calculation formula in the first iteration is r = -C _t ^k -a∙M ₀ I.e., it is desirable to provide at least a reduction in the total waiting time of M0 minutes for each additional quick departure time period.

Further, before the step is carried out, even before the passenger flow simulation of the preset total number of times of simulation is carried out, a quick departure simulation (namely, all the departure time periods in the day are quick departure) and a slow departure simulation (namely, all the departure time periods in the day are slow departure) are carried outDeparture), obtaining a theoretical shortest waiting time and a theoretical longest waiting time, and then dividing the difference between the theoretical longest waiting time and the theoretical shortest waiting time by the total number of departure time periods (36 in this embodiment) of one day to obtain a penalty term M for single-time quick departure ₀ I.e. M ₀ = (theoretical longest waiting time-theoretical shortest waiting time)/total number of departure time periods of one day.

S509: and generating sample data (s, a, r, s') according to the state data of the current departure time period, the action and the return value of the next departure time period and the state data of the next departure time period, and storing the sample data in a memory.

Further, if the storage capacity of the memory bank reaches the preset memory bank capacity, that is, the memory bank is full, the sample data stored in the memory bank at the earliest is deleted.

S510: and judging whether the next departure time period is the last departure time period of the operation time, if so, indicating that passenger flow simulation of all departure time periods in one day is finished, executing step S512, and if not, executing step S511.

S511: and taking the next departure time period as the current departure time period, taking the state data of the next departure time period as the state data of the current departure time period, taking the total waiting time of the next departure time period as the total waiting time of the current departure time period, and then returning to execute the step S503.

S512: counting the number of departure time periods of which the actions in the simulation are departure at a quick departure frequency to obtain the number of the quick departure time periods corresponding to the simulation, and calculating the total waiting time corresponding to the simulation according to the total waiting time of each departure time period of the operation time in the simulation; and meanwhile, taking the latest current value network when the simulation is finished as the current value network corresponding to the simulation.

Wherein, the total waiting time of the kth departure time period in the simulation is assumed to be C _t ^k ，k=[1，z]K ∈ N ×, z is the total number of departure time periods of one day, and z =36 in this embodiment, then this time is countedAnd accumulating the total waiting time of 36 departure time periods in the simulation to obtain the total waiting time corresponding to the simulation.

Further, as for steps S502 and S507, as shown in fig. 6, the process of performing the passenger flow simulation for one departure time period includes the following steps:

s601: and taking the first unit time of the current departure time period as the current unit time.

In this embodiment, the unit time is 1min, and one departure time period is 30min, so that 30 unit times are included in one departure time period.

S602: and respectively judging whether each station on the line has a train arrival station in the current unit time according to the action of the current departure time period and preset train operation data, if so, namely, a part of stations have the train arrival station and a part of stations have no train arrival station, executing a step S603 and then executing a step S604, and if not, namely, all stations on the line have no train arrival station, executing the step S604.

According to the action of the current departure time period, the specific departure time of the current departure time period can be determined. For example, if the fast departure frequency is 4 min/time (i.e., one trip is taken every 4 minutes, i.e., the departure bay interval is 4 min), the slow departure frequency is 8 min/time (i.e., one trip is taken every 8 minutes, i.e., the departure bay interval is 8 min), and the action of the current departure time period is slow departure, then one departure is respectively performed in 0 th, 8 th, 16 th, and 24 th minutes of the current departure time period.

Further, in this embodiment, a variable t is recorded _p Indicating the time length from the last departure time by comparing t _p And determining whether to departure or not according to the departure interval of the current departure time period. For example, if the operation of the previous departure time slot is slow departure and departures are performed at the 0 th, 8 th, 16 th and 24 th minutes of the previous departure time slot, if the operation of the current departure time slot is slow departure, departure is performed for the first time in the current departure time slot at the 2 nd minute of the current departure time slot, and if the operation of the current departure time slot is fast departure, t is t at this time _p =6min, greater than the current departure timeAnd the departure interval of the sections is 4min, the vehicles can be immediately departed, namely, the first departure in the current departure time period is carried out at the beginning of the current departure time period.

The running modes of the trains on the same line are generally the same, so that the arrival of the train at the station in the first few minutes after the train is dispatched can be known according to the train running data, and the train arrival at the stations in each unit time can be determined by combining the obtained dispatching time.

S603: according to the passenger trip data and the train capacity, people stream interactive processing is respectively carried out on each station with train arrival, the number of people at each station with the train arrival (excluding passengers who have got off and do not transfer) and the number of people in the train which has sent out are respectively updated according to the result of the people stream interactive processing, and the number of people waiting for each station with the train arrival in the current unit time is respectively counted.

Specifically, if a train arrives at a station on the route in the current unit time, people flow interaction processing is performed on the station according to passenger trip data and the capacity of the train, wherein the people flow interaction processing includes getting-on passengers getting off and getting-on passengers getting-on the station, the getting-on passengers include passengers getting-on (i.e., passengers on a train already sent on the route) and having an outbound station ID or a transit station ID of the station, the passengers in the station include inbound passengers (i.e., passengers just inbound) and transfer passengers (i.e., passengers transferring the route from other routes), the inbound passengers are passengers having inbound time earlier than the current unit time and having inbound station ID of the station and not getting-on, the transfer passengers are passengers having transit station ID of the station and belonging to the route, and the time of arriving at the transit station exceeds the preset transfer time and having not getting-on.

In this embodiment, it is considered that most of the subway station card swiping positions are closer to the boarding positions, so the time that passengers enter and walk to the platform is not considered in the simulation system, that is, the passengers start waiting after the passengers swipe cards into the platform by default. But due to the arrangement of the subway line, the vehicle walks to the transfer station after getting off the vehicle at the transfer stationIt often takes a certain time for the passenger, and therefore, in order to fully consider the transfer time of the passenger, the passenger is transferred in the design station at the transfer time t ₁ The passengers in the transfer state can not transfer the next subway at once, namely, the passengers in the transfer state do not count the number of passengers waiting for the subway. In this example, t ₁ =1min, this parameter can be adjusted.

And after people stream interaction processing is carried out, updating the number of passengers in the train which has sent out and arrives at the station in the current unit time and updating the number of passengers in the station at the station according to the number of passengers arriving at the station and the number of passengers in the station getting on the train.

S604: and respectively updating the number of the passengers in each station without train arrival in the current unit time according to the passenger trip data, and respectively counting the number of the passengers waiting for each station without train arrival in the current unit time.

Specifically, if no train arrives at a station on the line within the current unit time, the number of passengers entering the station within the current unit time is determined according to the passenger trip data and the station ID of the station, the number of passengers entering the station within the current unit time is updated according to the number of passengers entering the station, and meanwhile the number of waiting passengers at the station within the current unit time is counted.

S605: and counting the total number of waiting passengers in the current unit time according to the number of waiting passengers at each station on the line in the current unit time.

S606: and judging whether the current unit time is the last unit time of the current departure time period, if so, indicating that the passenger flow simulation of the current departure time period is finished, executing step S607, otherwise, executing step S608.

S607: and calculating to obtain the total waiting time of the current departure time period according to the total number of waiting passengers in each unit time in the current departure time period and the duration of the unit time.

Specifically, the total waiting time C of the kth departure time period _t ^k The calculation formula of (c) is as follows:

wherein, t ₀ Is the duration of a unit time, T ₀ Is the duration of a departure time period, in this embodiment, t ₀ =1min，T ₀ =30min，N _i The total number of waiting passengers in the ith unit time in the kth departure time period.

S608: the next unit time is taken as the current unit time, and the process returns to step S602.

S404: when new sample data is stored in the memory base and the number of the sample data in the memory base reaches a preset number threshold value, randomly selecting a batch of sample data from the memory base according to a preset batch size, and training to obtain the latest current value network according to the batch of sample data.

Specifically, as shown in fig. 7, the present step includes the steps of:

s701: when new sample data is stored in the memory base and the number of the sample data in the memory base is preset as a number threshold (for example 360), randomly selecting sample data with a preset batch size from the memory base as sample data of the current batch, and taking the latest current value network as the current value network to be trained.

In this embodiment, the preset batch size is 64, that is, 64 sample data are randomly selected as the current batch of sample data.

S702: and acquiring the p-th sample data from the current batch of sample data as the current sample data, wherein the initial value of p is 1.

S703: calculating the state data of the current departure time period in the current sample data and the score Q corresponding to the action of the next departure time period through the latest current value network to be trained _evel (s, a) as a first score corresponding to the current sample data.

S704: respectively calculating the scores of the state data of the next vehicle-starting time period in the current sample data corresponding to each action through the latest target value network, and enabling the maximum value maxQ of the scores to be the maximum value _a’∈A (s ', a') as the second sample data to which the current sample data correspondsAnd (6) scoring.

Where A is a predetermined set of actions, and in this embodiment, A is {0,1}.

S705: calculating a loss value according to a return value in the current sample data, a first score and a second score corresponding to the current sample data and a preset discount rate, and updating the latest network parameters of the current value network to be trained through a back propagation algorithm according to the loss value.

Specifically, the loss value is calculated according to a loss function, which is:

Loss=（Q _target (s,a)-Q _evel (s,a)） ² ，Q _target (s,a)=r+γ×max _a'∈A Q(s’,a’)

wherein Loss is Loss value, Q _evel (s, a) is the first score corresponding to the current sample data, r is the return value in the current sample data, gamma is the preset discount rate, max _a'∈A And Q (s ', a') is a second score corresponding to the current sample data.

And after the loss value is obtained through calculation, updating the latest network parameters of the current value network to be trained through a back propagation algorithm according to the loss value, wherein the updated current value network to be trained is the latest current value network.

S706: and judging whether the current batch of sample data is traversed or not, namely whether p is equal to the preset batch size or not, if so, executing the step, and otherwise, executing the step S707.

S707: and acquiring next sample data from the current batch of sample data as the current sample data, namely making p = p +1, and returning to execute the step S702.

S708: and taking the latest current value network to be trained as the latest current value network.

That is to say, in the process of performing the passenger flow simulation, the training process is performed synchronously, and each time a batch of sample data is trained, a latest current value network is obtained, and each time a latest current value network is obtained, the current value network is used in the subsequent passenger flow simulation process.

S405: and when the simulation times reach integral multiples of the preset first times, updating the network parameters of the target value network according to the latest network parameters of the current value network.

In this embodiment, the preset first number is 720, that is, every 720 times of passenger flow simulation is performed, that is, the network parameter of the target value network is replaced by the network parameter of the current value network. The alternative is that the neural network can learn further. Until the next replacement occurs, the network parameter θ' of the target value network is fixed and unchanged, and only the network parameter θ of the current value network is changed by the training of step S404.

S406: and when the simulation times reach the preset total simulation times, determining the optimal current value network corresponding to each quick departure time period according to the quick departure time period number, the total waiting time and the current value network corresponding to each simulation.

Specifically, the total waiting time corresponding to each simulation of the same quick departure time period quantity is compared, and the current value network corresponding to the simulation of the time with the least total waiting time is used as the optimal current value network corresponding to the same quick departure time period quantity. That is to say, for each fast departure time period quantity, a corresponding optimal current value network can be obtained, and in the embodiment, there are 36 fast departure time period quantities, and therefore there are 36 optimal current value networks.

Further, each iteration is carried out, the optimal current value network corresponding to each fast departure time period number in the current iteration is determined according to the fast departure time period number, the total waiting time and the current value network corresponding to each simulation in the current iteration, and then the optimal current value network is compared with the optimal solutions of the previous iterations, so that the optimal solution in the iterated round is determined.

Fig. 8-10 show the optimal solution search process of the DQN algorithm for different numbers of time periods of quick departure in the learning process, where fig. 8 shows a schematic diagram of the training results of the reinforcement learning model when the number of time periods of quick departure is 1-6, fig. 9 shows a schematic diagram of the training results of the reinforcement learning model when the number of time periods of quick departure is 7-14, fig. 10 shows a schematic diagram of the training results of the reinforcement learning model when the number of time periods of quick departure is 15-24, the abscissa in fig. 8-10 is the simulation times, and the ordinate is the inverse number of the total waiting time. Due to the limited simulation times in the single-round iteration and the preference of the DQN algorithm, in the illustrated experiment, the DQN algorithm does not attempt to select a strategy in which the number of fast departure time periods is too high, because when the number of fast departure time periods is too large, there is less benefit from continuously increasing the number of fast departure time periods. After training, the optimal solution schematic diagram of different fast departure time period quantities as shown in fig. 11 can be obtained, in fig. 11, the abscissa is the fast departure time period quantity, the ordinate is the opposite number of the total waiting time, and the point through which the thickened black line passes is the optimal model when the fast departure time period quantity is the abscissa value under the current condition. When the abscissa in the figure is 8, the black star point is the departure result according to the official schedule.

S407: and selecting an optimal current value network corresponding to the number of the fast departure time periods according to the requirement.

S408: and acquiring the state data of the departure time period of the line, and determining the action of the next departure time period of the departure time period through the selected optimal current value network according to the state data of the departure time period.

Specifically, an optimal current value network is selected according to actual requirements, for example, assuming that the number of fast departure vehicles of a line to be analyzed in the actual requirements in the operation time is 8, an optimal current value network corresponding to the number of fast departure time periods x =8 is selected, further, network parameters of the selected optimal current value network are loaded to a target value network, then, according to actual conditions, state data of the line at the end of the current departure time period is obtained, the state data and each action in an action set are respectively transmitted to the loaded target value network, a score corresponding to each action is obtained, and then, an action corresponding to the maximum score value is used as a departure suggestion of a next departure time period.

That is to say, for a line to be analyzed, the number of the fast departure time periods of the line can be determined according to the actual requirements of a subway operator, then the optimal current value network is selected according to the determined number of the fast departure time periods, then the state data of the first departure time period of the current day on the line is obtained, the action of the second departure time period of the current day is determined through the selected optimal current value network, the state data of the second departure time period is obtained when the second departure time period is finished, then the action of the third departure time period of the current day is determined through the selected optimal current value network, and the like.

The prior method (such as a genetic algorithm) can only calculate the optimal solution of the previous day according to the pedestrian flow condition of the subway of the previous day, and assumes that the current pedestrian flow condition is approximately the same as the current day, and the obtained result is used for the current day, so that the traditional algorithm ignores many pedestrian flow characteristics of the current day and cannot make dynamic adjustment in time. During training, the DQN algorithm gives the environment state of the intelligent agent at the current moment, and the intelligent agent makes the departure action selection of the next time period. This means that when the algorithm is deployed in actual production life, the departure decision of the next time period can be given in real time only by giving the subway and the subway station state of the current time.

In summary, the present embodiment has the following advantages:

1. the departure mode can be dynamically adjusted in real time according to the current subway pedestrian flow condition, and the rail transit pressure is effectively reduced.

2. The method has excellent self-adaptive capacity to unexpected situations such as abnormal people flow. In the abnormal traffic test in the morning/evening, the scheme of the embodiment reduces the waiting time of passengers by 43.37% (751723 min) and 21.96% (319761 min) respectively relative to the static schedule.

3. This produced a static schedule with a number of fast departure periods of 8 for comparison with the official schedule, with a reduction of about 2% in total waiting time (about 12663 min).

4. Compared with some traditional algorithms, such as genetic algorithm, the static comparison generated on several subdata sets almost achieves the optimal solution in a small-range exhaustible test sample, and the training time is only slightly increased. In addition, after training is finished, the obtained model only needs to execute inference, the inference function can be finished in about 1 minute generally, and the method can meet the departure requirement of the actual subway.

5. Different from the traditional static method, the method explores a larger solution space, and can generate optimal models with different fast departure quantities along with the learning process of the intelligent agent in the training process, and subway workers can select the appropriate models to deploy according to the actual conditions.

Example two

Referring to fig. 2, a second embodiment of the present invention is: the dispatching device for dispatching the train can execute the dispatching method for dispatching the train provided by the embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method. The device can be implemented by software and/or hardware, and specifically comprises:

an obtaining module 202, configured to obtain passenger travel data of a historical day and a capacity of a train on the one route, where the passenger travel data includes inbound time, inbound station ID, outbound station ID, and transit station ID of each passenger, and the transit station ID is determined according to a shortest path algorithm between an inbound station and an outbound station;

the simulation module 203 is used for performing passenger flow simulation of a preset total simulation number according to the passenger travel data and the capacity of the trains on the line, acquiring sample data in the process of each simulation, storing the sample data in a memory library, counting the number of quick departure time periods and the total waiting time corresponding to the secondary simulation after each simulation is finished, and determining a current value network corresponding to the secondary simulation, wherein the primary simulation is passenger flow simulation of the operation time of one day, each sample data comprises state data of a departure time period, action and state data of the next departure time period of the departure time period and a return value, the state data comprises the number of passengers at each station on the line, the position of the issued trains and the number of issued trains, the action is used for issuing the trains at a preset quick departure frequency or issuing the trains at a preset slow departure frequency, and the action of the next departure time period is determined according to the latest current value network;

In an alternative embodiment, the simulation module 203 includes:

the first presetting unit is used for presetting the action of a first departure time period of the operation time in the ith simulation, taking the first departure time period as the current departure time period, and setting the initial value of i as 1;

the first simulation unit is used for simulating passenger flow in the current departure time period according to the action, passenger travel data, capacity and preset unit time of the current departure time period, acquiring state data when the current departure time period is ended as the state data of the current departure time period, and meanwhile, counting the total waiting time of the current departure time period according to the number of waiting passengers at each station on the line in each unit time in the current departure time period;

a first generating unit, configured to generate a random number, where the range of the random number is 0-1;

the second generation unit is used for randomly generating the action of the next departure time period of the current departure time period if the random number is smaller than the exploration rate corresponding to the ith simulation;

the first determining unit is used for determining the action of the next departure time period of the current departure time period according to the latest current value network if the random number is greater than or equal to the exploration rate corresponding to the ith simulation;

the second simulation unit is used for simulating passenger flow of the next departure time period according to the action of the next departure time period, passenger travel data, capacity and preset unit time, acquiring state data when the next departure time period is finished as the state data of the next departure time period, and meanwhile, counting the total waiting time of the next departure time period according to the number of waiting passengers at each station on the line in each unit time in the next departure time period;

the system comprises a first calculation unit, a second calculation unit and a third calculation unit, wherein the first calculation unit is used for calculating a return value according to the total waiting time of the current departure time period, the action of the next departure time period and a penalty item function corresponding to the jth iteration, j = 8968, i/epoch \8969;, and the epoch is the preset simulation times of each iteration;

a third generating unit, configured to generate one sample data according to the status data of the current departure time period, the action of the next departure time period, the report value, and the status data of the next departure time period, and = store the sample data in the memory;

the first judgment unit is used for judging whether the next departure time period is the last departure time period of the operation time or not to obtain a first judgment result;

the first execution unit is used for taking the next departure time period as the current departure time period and returning to execute the first generation unit if the first judgment result is negative;

a first obtaining unit, configured to, if the first determination result is yes, count the number of departure time periods in the ith simulation, where the number of departure time periods is used as a departure frequency for departure, obtain the number of the fast departure time periods corresponding to the ith simulation, calculate, according to the total waiting time of each departure time period of the operation time in the ith simulation, the total waiting time corresponding to the ith simulation, and use the current latest current value network as the current value network corresponding to the ith simulation;

the second judgment unit is used for judging whether i is equal to the preset total simulation times or not to obtain a second judgment result;

and the second determining unit is used for determining the corresponding exploration rate of the (i + 1) th simulation according to the exploration rate corresponding to the ith simulation and the preset minimum exploration rate if the second judgment result is negative, wherein the exploration rate corresponding to the first simulation is a preset exploration rate initial value, and the first preset unit is returned to be executed after i = i + 1.

In an optional embodiment, the first analog unit includes:

the first serving subunit is used for taking the first unit time of the current departure time period as the current unit time;

the first judging subunit is used for respectively judging whether a train arrives at each station on the line in the current unit time according to the action of the current departure time period and preset train operation data;

the interactive processing subunit is used for carrying out people flow interactive processing on the bus station according to passenger travel data and the train capacity if a train arrives at the bus station, wherein the people flow interactive processing comprises passenger getting-off and passenger getting-on in the bus station, the passenger getting-on comprises a passenger who gets on the bus and has an outbound station ID or a transit station ID as the station ID of the bus station, the passenger getting-on in the bus station comprises an inbound passenger and a transfer passenger, the inbound passenger comprises a passenger who has an inbound time earlier than the current unit time and has no boarding and has an inbound station ID earlier than the current unit time, and the inbound station ID is the station ID of the bus station, and the transfer passenger comprises a passenger who has a transit station ID as the station ID of the bus station, has an arrival time longer than a preset transfer time, and has a station corresponding to the outbound station ID, and has no boarding;

the first updating subunit is used for updating the number of people in the station of the station and the number of people in the train which is sent out according to the people flow interaction processing result, and counting the number of people waiting for the train in the current unit time;

the second updating subunit is used for updating the number of people in the station of the station according to the passenger travel data and the station ID of the station if no train arrives at the station, and counting the number of people waiting for the train at the station in the current unit time;

the counting subunit is used for counting the total number of waiting passengers in the current unit time according to the number of waiting passengers at each station on the line in the current unit time;

the second judgment subunit is used for judging whether the current unit time is the last unit time of the current departure time period or not;

and the second as subunit is used for taking the next unit time as the current unit time and returning to execute the first judging subunit if the judgment result of the second judging subunit is negative.

In an optional embodiment, the first determining unit is specifically configured to, if the random number is greater than or equal to the search rate corresponding to the ith simulation, calculate, according to the state data of the current departure time period and a preset action set, a score of each action in the action set through a latest current value network, and use an action corresponding to a maximum score as an action of a next departure time period, where the action set includes departure at a preset fast departure frequency and departure at a preset slow departure frequency.

In an optional embodiment, the first calculating unit is specifically configured to calculate the report value according to a report value calculation formula, where the report value calculation formula is r = -C _t ^k -a(f _j (x)-f _j (x-1)), wherein r is a reported value, C _t ^k If the action of the next departure time period is to proceed departure at a preset fast departure frequency, a =1, and if the action of the next departure time period is to proceed departure at a preset slow departure frequency, a =0,f _j (x) And (4) performing a corresponding penalty term function for the j-th iteration.

In an optional embodiment, the simulation module 203 further includes:

the third judging unit is used for judging whether i is equal to the integral multiple of the simulation times of each iteration to obtain a third judging result;

a third determining unit, configured to determine a penalty term function corresponding to the j +1 th iteration according to the penalty term function update formula and a penalty term function corresponding to the j th iteration if the third determination result is yes, where the penalty term function update formula is f _j+1 (x)=K _new ×Smooth‍（C _best,j (x)）+K _old ×f _j (x) Wherein f is _j+1 (x) Penalty function corresponding to iteration of round j +1, f _j (x) For the penalty function corresponding to the j-th iteration, smooth () is a smoothing function, C _best,j (x) Represents the minimum total waiting time, K, corresponding to the number x of the fast departure time periods in the jth iteration _new And K _old For preset regulating parameters, f ₁ (x)=x∙M ₀ X represents the number of fast departure time periods, M ₀ Is a preset penalty item for single quick departure.

In an optional embodiment, the departure scheduling apparatus further includes:

the slow departure simulation module is used for setting the actions of each departure time period of the operation time to be departure at a preset slow departure frequency, carrying out one-time slow departure simulation according to the passenger travel data and the train capacity of the one line, and counting the total waiting time of each departure time period in the slow departure simulation to obtain the theoretical longest waiting time;

the quick departure simulation module is used for setting the actions of each departure time period in the operation time to be departure at a preset quick departure frequency, carrying out one-time quick departure simulation according to the passenger trip data and the capacity of the train of the one line, and counting to obtain the theoretical shortest waiting time according to the total waiting time of each departure time period in the quick departure simulation;

and the calculation module is used for dividing the difference between the theoretical longest waiting time and the theoretical shortest waiting time by the total number of departure time periods of a preset day to obtain a penalty item for single-time express departure.

In an optional embodiment, the second determining unit is specifically configured to determine the exploration rate corresponding to the i +1 th simulation according to an exploration rate update formula, where the exploration rate update formula is ∈ _i+1 =max（ε _min ，ε _i + 0.0045) in which ε _i+1 For the i +1 th simulation of the corresponding exploration rate, ε _i For the ith simulation of the corresponding exploration rate, epsilon _min Is a preset minimum exploration rate.

In an optional embodiment, the simulation module 203 further includes:

and the deleting unit is used for deleting the sample data which is stored in the memory bank earliest if the memory bank is full.

In an optional embodiment, the training module 204 includes:

the selecting unit is used for randomly selecting sample data with a preset batch size from the memory base as sample data of the current batch when new sample data are stored in the memory base and the number of the sample data in the memory base reaches a preset number threshold value, and taking the latest current value network as a current value network to be trained;

the acquisition unit is used for traversing the sample data of the current batch and sequentially acquiring sample data from the sample data of the current batch;

the second calculating unit is used for calculating the state data of the current departure time period and the score corresponding to the action of the next departure time period in the sample data through the latest current value network to be trained, and the state data and the score are used as first scores corresponding to the sample data;

the third calculating unit is used for respectively calculating the scores of the state data of the next vehicle-starting time period in the sample data corresponding to each action through a latest target value network, and taking the maximum score as a second score corresponding to the sample data;

the fourth calculating unit is used for calculating a loss value according to a return value in the sample data, the first score and the second score corresponding to the sample data and a preset discount rate, and updating the latest network parameter of the current value network to be trained according to the loss value;

and the unit is used for taking the latest current value network to be trained as the latest current value network after the current batch of sample data is traversed.

In an optional embodiment, the departure scheduling apparatus further includes:

and the updating module is used for updating the network parameters of the target value network according to the latest network parameters of the current value network when the simulation times reach the integral multiple of the preset first times.

In an optional embodiment, the fourth calculating unit is specifically configured to calculate the Loss value according to a Loss function, where the Loss function is Loss = (Q) _target (s,a)-Q _evel (s,a)） ² ，Q _target (s,a)=r+γ×max _a'∈A Q (s ', a'), wherein Loss is the Loss value, Q _evel (s, a) is the first score corresponding to the sample data, r is the return value in the sample data, γ is the predetermined discount rate, max _a'‍∈‍A Q (s ', a') is a second score corresponding to the sample data.

In an optional embodiment, the first determining module 205 is specifically configured to, when the simulation times reach a preset total simulation times, compare total waiting times corresponding to simulations of the same number of fast departure time periods, and use a current value network corresponding to one simulation with the smallest total waiting time as an optimal current value network corresponding to the same number of fast departure time periods.

EXAMPLE III

Referring to fig. 3, a third embodiment of the present invention is: an electronic device, the electronic device comprising:

one or more processors 301;

a storage device 302 for storing one or more programs;

when the one or more programs are executed by the one or more processors 301, the one or more processors 301 implement the processes in the embodiment of the departure scheduling method described above, and can achieve the same technical effect, and in order to avoid repetition, the description is omitted here.

Example four

A fourth embodiment of the present invention provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements each process in the dispatch method embodiment of departure described above, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.

In summary, according to the dispatching method, the dispatching device, the dispatching equipment and the storage medium for dispatching the train, provided by the invention, because the DQN algorithm is used for making the selection of the dispatching action of the next time period by giving the environmental state of the intelligent agent at the current time during training, when the algorithm is deployed in actual production life, the dispatching decision of the next dispatching time period can be given in real time only by giving the states of the train and the station at the current time, so that the dispatching mode can be dynamically adjusted in real time according to the current traffic flow situation, the track traffic pressure is effectively reduced, and the self-adaptive capacity to the emergency situations such as abnormal traffic flow is excellent; meanwhile, the optimal models corresponding to different quick departure time periods are generated along with the learning process of the intelligent agent in the training process, and the staff can select the appropriate models to deploy according to the actual conditions.

From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly, can also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present invention or portions thereof contributing to the prior art may be embodied in the form of a software product, which may be stored in a computer readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods according to the embodiments of the present invention.

It should be noted that, in the embodiment of the foregoing apparatus, each unit and each module included in the apparatus are merely divided according to functional logic, but are not limited to the above division, as long as the corresponding functions can be implemented; in addition, the specific names of the functional units are only for the convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.

The above description is only an embodiment of the present invention, and is not intended to limit the scope of the present invention, and all equivalent modifications made by the present invention and the contents of the accompanying drawings, which are directly or indirectly applied to the related technical fields, are included in the scope of the present invention.

Claims

1. A departure scheduling method, comprising:

obtaining passenger travel data of one historical day and the capacity of the train of the line, wherein the passenger travel data comprise the inbound time, the inbound station ID, the outbound station ID and the transit station ID of each passenger, and the transit station ID is determined according to the inbound station and the outbound station through a shortest path algorithm;

selecting an optimal current value network corresponding to the number of the time periods of quick departure according to the requirement;

acquiring state data of a departure time period of the line, and determining the action of the next departure time period of the departure time period through the selected optimal current value network according to the state data of the departure time period;

when new sample data is stored in a memory base and the number of the sample data in the memory base reaches a preset number threshold, randomly selecting a batch of sample data from the memory base according to a preset batch size, training a latest current value network according to the batch of sample data, and taking the trained current value network as the latest current value network, wherein the method comprises the following steps:

2. The departure scheduling method according to claim 1, wherein said performing passenger flow simulation for a preset total number of times of simulation according to said passenger travel data and the capacity of the train on said one route, acquiring sample data during each simulation, storing the sample data in a memory, counting the number of fast departure time periods and the total waiting time corresponding to the current simulation after each simulation is finished, and determining the current value network corresponding to the current simulation, comprises:

generating a random number, wherein the range of the random number is 0-1;

if the random number is smaller than the corresponding exploration rate of the ith simulation, randomly generating the action of the next departure time period of the current departure time period;

if the random number is larger than or equal to the exploration rate corresponding to the ith simulation, determining the action of the next departure time period of the current departure time period according to the latest current value network;

calculating a return value according to the total waiting time of the current departure time period, the action of the next departure time period and a penalty term function corresponding to the jth iteration, wherein j = 8968i/epoch \8969;, epoch is the preset simulation times of each iteration;

if not, taking the next departure time period as the current departure time period, and continuing to execute the step of generating the random number until the step of judging whether the next departure time period is the last departure time period of the operation time or not;

judging whether i is equal to a preset total simulation number;

and if not, determining the corresponding exploration rate of the (i + 1) th simulation according to the exploration rate corresponding to the ith simulation and a preset minimum exploration rate, wherein the exploration rate corresponding to the first simulation is a preset exploration rate initial value, enabling i = i +1, continuously executing the action of the first departure time period of the operation time in the preset ith simulation, and taking the first departure time period as the current departure time period.

3. The departure scheduling method according to claim 2, wherein the performing of the passenger flow simulation in the current departure time period according to the action of the current departure time period, the passenger travel data, the capacity amount and the preset unit time comprises:

4. The departure scheduling method according to claim 2, wherein said act of determining the next departure time period of the current departure time period from the most recent current value network comprises:

5. The departure scheduling method according to claim 2, wherein the calculating a return value according to the total waiting time of the current departure time period, the action of the next departure time period and the penalty term function corresponding to the jth iteration comprises:

calculating the return according to the return value calculation formulaA value, the reported value has a calculation formula of r = -C _t ^k -a(f _j (x)-f _j (x-1)), wherein r is a reported value, C _t ^k If the action of the next departure time period is to proceed departure at a preset fast departure frequency, a =1, and if the action of the next departure time period is to proceed departure at a preset slow departure frequency, a =0,f _j (x) And (4) performing a corresponding penalty term function for the j-th iteration.

6. The departure scheduling method according to claim 5, wherein before the order i = i +1, further comprising:

if yes, determining a penalty term function corresponding to the j +1 th iteration according to a penalty term function updating formula and a penalty term function corresponding to the j th iteration, wherein the penalty term function updating formula is f _j+1 (x)=K _new ×Smooth‍（C _best,j (x)）+K _old ×f _j (x) Wherein f is _j+1 (x) Penalty term function corresponding to the j +1 th iteration, f _j (x) For the penalty function corresponding to the j-th iteration, smooth () is a smoothing function, C _best,j (x) Represents the minimum total waiting time, K, corresponding to the number x of the fast departure time periods in the jth iteration _new And K _old For preset regulating parameters, f ₁ (x)=x∙M ₀ X represents the number of fast departure time periods, M ₀ Is a preset punishment item for single quick departure.

7. The departure scheduling method according to claim 6, further comprising, after acquiring the passenger trip data of the historical day and the capacity of the train on the one route:

setting the actions of each departure time period in the operation time as departure at a preset quick departure frequency, carrying out one-time quick departure simulation according to the passenger trip data and the capacity of the train of the one line, and counting to obtain the theoretical shortest waiting time according to the total waiting time of each departure time period in the quick departure simulation;

and dividing the difference between the theoretical longest waiting time and the theoretical shortest waiting time by the total number of departure time periods in a preset day to obtain a punishment item of single-time express departure.

8. The departure scheduling method according to claim 2, wherein the determining the (i + 1) th simulated corresponding exploration rate according to the exploration rate corresponding to the (i) th simulation and the preset minimum exploration rate comprises:

determining the exploration rate corresponding to the i +1 th simulation according to an exploration rate updating formula, wherein the exploration rate updating formula is epsilon _i+1 =max（ε _min ，ε _i + 0.0045) in which ε _i+1 For the i +1 th simulation, the corresponding exploration ratio, ε _i For the ith simulation of the corresponding exploration rate, epsilon _min Is a preset minimum exploration rate.

9. The departure scheduling method according to claim 8, wherein the predetermined initial value of the search rate e ₁ =1, preset minimum search rate ∈ _min =0.1。

10. The departure scheduling method according to claim 2, further comprising, before storing the sample data in a memory bank:

11. The departure scheduling method according to claim 1, further comprising:

and when the simulation times reach integral multiple of the preset first times, updating the network parameters of the target value network according to the latest network parameters of the current value network.

12. The departure scheduling method according to claim 1, wherein the calculating a loss value according to the report value in the sample data, the first score and the second score corresponding to the sample data, and the predetermined discount rate includes:

calculating a Loss value according to a Loss function, the Loss function being Loss = (Q) _target (s,a)-Q _evel (s,a)） ² ，Q _target (s,a)=r+γ×max _a'∈A Q (s ', a'), wherein Loss is the Loss value, Q _evel (s, a) is a first score corresponding to the sample data, r is a return value in the sample data, γ is a predetermined discount rate, max _a'‍∈‍A Q (s ', a') is a second score corresponding to the sample data.

13. The departure scheduling method according to claim 1, wherein the determining the optimal current value network corresponding to each number of the fast departure time periods according to the number of the fast departure time periods, the total waiting time and the current value network corresponding to each simulation comprises:

and comparing the total waiting time corresponding to each simulation of the same quick departure time period quantity, and taking the current value network corresponding to the simulation with the least total waiting time as the optimal current value network corresponding to the same quick departure time period quantity.

14. A dispatching device that dispatches vehicles, comprising:

the system comprises an acquisition module, a traffic information processing module and a traffic information processing module, wherein the acquisition module is used for acquiring passenger travel data of one historical day and the capacity of trains of the one line, the passenger travel data comprises inbound time, inbound station ID, outbound station ID and transit station ID of each passenger, and the transit station ID is determined according to the inbound station and the outbound station through a shortest path algorithm;

the simulation module is used for carrying out passenger flow simulation of preset total times of simulation according to the passenger travel data and the capacity of the trains on the line, acquiring sample data in the process of each simulation, storing the sample data into a memory, counting the number of time periods for quick departure and the total waiting time corresponding to the secondary simulation after each simulation is finished, and determining a current value network corresponding to the secondary simulation, wherein the primary simulation is passenger flow simulation of the operation time of one day, each sample data comprises state data of the time period for one departure, action and state data and a return value of the time period for the next departure of the time period for one departure, the state data comprises the number of passengers at each station on the line, the position of the issued trains and the number of issued passengers in the train, the action is used for issuing the train at a preset quick departure frequency or issuing the train at a preset slow departure frequency, and the action of the time period for the next departure is determined according to the latest current value network;

the training module is used for randomly selecting a batch of sample data from the memory bank according to the size of a preset batch when new sample data is stored in the memory bank and the number of the sample data in the memory bank reaches a preset number threshold, training the latest current value network according to the batch of sample data, and taking the trained current value network as the latest current value network;

the second determining module is used for acquiring the state data of the one-time-of-departure period of the one line and determining the action of the next one-time-of-departure period of the one-time-of-departure period through the selected optimal current value network according to the state data of the one-time-of-departure period;

the training module comprises:

15. An electronic device, characterized in that the electronic device comprises:

one or more processors;

storage means for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement the departure scheduling method of any of claims 1-13.

16. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the departure scheduling method according to any one of claims 1-13.