CN114240002A

CN114240002A - Bus departure timetable dynamic optimization algorithm based on deep reinforcement learning

Info

Publication number: CN114240002A
Application number: CN202210028133.4A
Authority: CN
Inventors: 伦嘉铭
Original assignee: Individual
Current assignee: Individual
Priority date: 2022-01-11
Filing date: 2022-01-11
Publication date: 2022-03-25

Abstract

The invention belongs to the technical field of intelligent bus dispatching systems, and discloses a dynamic optimization algorithm of a bus departure schedule based on deep reinforcement learning, wherein a deep reinforcement learning method is introduced when the bus departure schedule is optimized, and the method comprises the following steps: the deep reinforcement learning is based on a simulation model, the randomness that the travel time between the buses is possibly influenced by factors such as road conditions, traffic lights, weather and the like is considered, the passenger flow characteristics generated according to passenger requirements and OD rules are added, and the defects of the classical commercial traffic simulation software due to the lack of functions of modeling and performance evaluation of the passenger flow are overcome. The invention considers the complex passenger flow when dynamically optimizing the bus departure schedule, establishes a bus running simulation model combining the passenger flow and the traffic flow, can capture the complex environment state in real time by a PPO algorithm and quickly generate the bus departure schedule, is a dynamic optimization strategy, and has higher robustness in the face of the complex environment.

Description

Bus departure timetable dynamic optimization algorithm based on deep reinforcement learning

Technical Field

The invention relates to the technical field of intelligent bus dispatching systems, in particular to a dynamic optimization algorithm of a bus departure schedule based on deep reinforcement learning.

Background

The mainstream method for researching the bus departure timetable at home and abroad is to adopt an operation research modeling method, establish an objective function combined by single or multiple indexes such as minimum passenger waiting time, shortest transfer time, lowest bus operation cost and the like, and optimally calculate the departure time or departure interval of a single or multiple bus routes. On the other hand, the scheduling optimization method based on the traffic simulation model has the advantage that the traffic flow and passenger flow dynamic behaviors can be truly reflected. However, traffic flow characteristics can be embodied by adopting traffic simulation software such as classical Aimsum, Vissim and Paramics, but the functions of passenger flow modeling and performance evaluation are lacked.

Because the influence of many factors (such as mixed traffic flow, passenger flow, traffic lights, etc.) must be considered when solving the bus scheduling optimization problem; most of the existing researches for bus dispatching optimization, whether the static optimization problem is solved by using a traditional operational research modeling method or the dynamic optimization problem is solved by using some machine learning methods, the solution obtained by the algorithm is not ideal due to the short visibility of decision and the lack of exploration on the problem structure and the influence of numerous uncertain environmental factors;

in summary, there is no simulation model-based scheduling optimization research that associates passenger flow with bus operation through a simulation means, thereby systematically representing the combination of passenger flow and passenger flow.

Disclosure of Invention

The invention aims to solve the problem that a simulation model based scheduling optimization research combining traffic flow and passenger flow is systematically realized due to the fact that the passenger flow is related to the operation of a bus through a simulation means is lacked in the prior art, and the dynamic optimization algorithm of the bus departure schedule based on deep reinforcement learning is provided.

In order to achieve the purpose, the invention adopts the following technical scheme:

the bus departure timetable dynamic optimization algorithm based on deep reinforcement learning introduces a deep reinforcement learning method when optimizing a bus departure timetable, and comprises the following steps: s1, deep reinforcement learning is based on a simulation model, randomness that travel time between buses is possibly influenced by road conditions, traffic lights, weather and other factors is considered, passenger flow characteristics generated according to passenger requirements and OD rules are added, the defects of the conventional commercial traffic simulation software due to lack of functions of modeling and performance evaluation of passenger flow are overcome, the traffic flow and the passenger flow are expressed, and an actual bus system can be reflected more truly; s2, the reinforcement learning is combined with the principle of dynamic planning and supervised learning, complex scenes can be processed, and the reinforcement learning method has the capability of real-time learning and lifelong learning, so that the reinforcement learning method is very suitable for solving the problem of bus scheduling, and the reinforcement learning method can learn the optimal decision function by using reinforcement signals only by giving a group of feasible actions (Action) and the current bus system State (State), and further make the optimal decision Action.

Further, the reinforcement learning system involves two subjects, namely an Agent (Agent) and an Environment (Environment); the public traffic system has various possible complex states, and can select six characteristics to form a state set s by taking a vehicle as an object_t＝{t_t，k_t，j_t，l_t，b_t，p_t}; wherein, t_tRepresents the time at time step t; k is a radical of_t、j_t，、l_tRespectively representing the driving direction of the vehicle, the station section of the vehicle and the distance to the next station; b_tIndicating the remaining capacity of the vehicle compartment; p is a radical of_tThe total number of waiting people of all stations between the current vehicle and the vehicle in front of the current vehicle is represented; there are two possible operations of the agent: and determining 'departure' and 'non-departure' of the station yard. And then, the state is predicted and action decision is carried out according to a certain time interval for a period of time in the future, so that the dynamic optimization of the bus departure schedule can be realized.

Furthermore, the intelligent agent corresponds to a bus dispatcher, and the environment is constructed based on a bus running simulation model; to realize the bus operation simulation, except for preset data such as line stop information, station vehicle number, departure timetable, departure type and the like, a vehicle travel time rule and a passenger flow 0D rule are required; the travel time of the bus is influenced by factors such as road conditions, traffic lights, station passengers getting on and off the bus, and the passenger 0D is influenced by factors such as weather, trip behaviors, trip time intervals and the like, both are relatively complex random variables, and in order to fit the distribution rule of the random variables, a Kernel Density Estimation (KDE) mode is adopted, so that random numbers required by simulation are generated.

Furthermore, the bus running simulation model mainly comprises three parts, namely, a first departure station, a vehicle arriving and leaving station and a passenger getting on and off the bus; when a vehicle arrives at a bus stop, a series of actions such as queuing for entering the bus stop, opening a vehicle door, getting on and off a passenger, closing the vehicle door and the like are carried out, and then the vehicle leaves the bus stop, wherein the service process can be represented as follows:

wherein the content of the first and second substances,

and

respectively representing the time when the vehicle arrives at and departs from the j station in the k direction pass i, beta represents the time for opening and closing the door, and beta represents the time for opening and closing the door_bAnd beta_oRespectively representing the average time spent getting on and off each passenger,

and

respectively representing the number of passengers getting on the vehicle and the number of passengers getting off the vehicle when the vehicle stops at the j station in the k direction;

the number of waiting people at the station actually considers the number of the preorders due to the failure of getting on the bus:

wherein the content of the first and second substances,

the number of the total waiting people for the time i +1 at the station j in the direction k,

indicating the number of passengers who are staying at the j stop due to unsuccessful boarding when the k-direction pass i reaches the j stop,

represents the number of passengers arriving at station j within the time of pass i and pass i +1 in the k direction;

the number of passengers getting on the bus at the stop is limited by the number of passengers carrying the bus:

wherein m represents the number of nuclear people,

representing the number of occupants present when pass i leaves site j-1 in the k direction.

The method for calculating the number of the passengers in the bus is as follows:

the method for calculating the number of the detained people at the station is as follows:

wherein the content of the first and second substances,

indicating the number of arrival waits for the k-way stop j before the first trip reaches the stop.

Furthermore, the execution of each departure action may affect the effects of other actions; furthermore, the effect on the system after each action has been performed is often delayed, since the vehicle must reach the corresponding station in order to provide the passenger with service; therefore, in the running process of the bus, due to reasons such as passenger flow, road conditions and the like, deviation between the bus returning to the first station and the last station and a pre-planned time schedule often occurs after the bus finishes the preorder times, so that the number of the bus in the vehicle execution time schedule cannot be arranged; in the case of a station without a vehicle, if the agent is still making an departure action, a corresponding penalty should be incurred to better guide the agent toward the target state.

Further, the Reward function (Reward function) is composed of two parts, namely the equal-person waiting time and the penalty of wrong departure instruction:

wherein, the time step T is in the value range of [0, T]；τ＝{s₁，a₁，s₂，a₂，…，s_T，a_TAnd is the track of state and action of one round; gamma is a discount factor, and the value range is (0, 1)]；p_tPunishment of wrong departure instruction is taken for the intelligent agent at the time step t; AWT is the waiting time of the average person; in order to maximize an objective function, a network structure comprising an input layer, an Actor (Actor) network and an evaluator (Critic) is designed by utilizing an artificial neural network; the artificial neural network has a complex mathematical structure, so that the artificial neural network can process a complex public traffic system considering random traffic flow and passenger flowThe method has more advantages; the state space sequence(s)_t，s_t+1，…，s_T) The data is used as an input layer and is input into hidden layers of an operator network and a critic network, and the operator network makes decision-making action according to the current state of the public transportation system; the critic network is used for estimating the state value function

Further calculate the merit function

And participate in parameter updates for the actor network.

In summary, the invention includes at least one of the following beneficial technical effects:

1. the method comprises the steps that training data are collected through bus running simulation, an artificial neural network is trained through a near-end Policy Optimization (PPO) algorithm, the algorithm can capture a complex environment state in real time, and a bus departure schedule is dynamically optimized;

2. the PPO algorithm is used for training the artificial neural network, and the efficiency is higher than that of the traditional method for solving the departure time table after the network training is finished; more importantly, the PPO algorithm can capture the complex environment state in real time and quickly generate a bus departure schedule, and is a dynamic optimization strategy; in addition, the performance of the heuristic algorithm may be affected by fine tuning of the underlying parameter settings, and is difficult to adapt to a new environment, while the invention has higher robustness in the face of a complex environment.

Drawings

FIG. 1 is a learning process diagram illustrating reinforcement learning according to the present invention;

FIG. 2 is a flow chart of bus simulation operation according to the present invention;

FIG. 3 is a diagram illustrating an Actor-critical network structure according to the present invention.

Detailed Description

The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention; it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments, and all other embodiments obtained by those skilled in the art without any inventive work are within the scope of the present invention.

In the description of the present invention, it should be noted that the terms "upper", "lower", "inner", "outer", "top/bottom", and the like indicate orientations or positional relationships based on those shown in the drawings, and are only for convenience of description and simplification of description, but do not indicate or imply that the referred device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and thus should not be construed as limiting the present invention. Furthermore, the terms "first" and "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

In the description of the present invention, it should be noted that, unless otherwise explicitly specified or limited, the terms "mounted," "disposed," "sleeved/connected," "connected," and the like are to be construed broadly, e.g., "connected," which may be fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.

The present invention will be described in further detail with reference to the accompanying drawings.

Referring to fig. 1-3, the bus departure schedule dynamic optimization algorithm based on deep reinforcement learning introduces a deep reinforcement learning method when optimizing a bus departure schedule, and comprises the following steps: s1, deep reinforcement learning is based on a simulation model, randomness that travel time between buses is possibly influenced by road conditions, traffic lights, weather and other factors is considered, passenger flow characteristics generated according to passenger requirements and OD rules are added, the defects of the conventional commercial traffic simulation software due to lack of functions of modeling and performance evaluation of passenger flow are overcome, the traffic flow and the passenger flow are expressed, and an actual bus system can be reflected more truly;

s2, the reinforcement learning is combined with the principle of dynamic planning and supervised learning, complex scenes can be processed, and the reinforcement learning method has the capability of real-time learning and lifelong learning, so that the reinforcement learning method is very suitable for solving the problem of bus scheduling, and the reinforcement learning method can learn the optimal decision function by using reinforcement signals only by giving a group of feasible actions (Action) and the current bus system State (State), and further make the optimal decision Action.

The reinforcement learning system involves two subjects, namely an Agent and an Environment; the public traffic system has various possible complex states, and can select six characteristics to form a state set s by taking a vehicle as an object_t＝{t_t，k_t，j_t，l_t，b_t，p_t}; wherein, t_tRepresents the time at time step t; k is a radical of_t、j_t，、l_tRespectively representing the driving direction of the vehicle, the station section of the vehicle and the distance to the next station; b_tIndicating the remaining capacity of the vehicle compartment; p is a radical of_tThe total number of waiting people of all stations between the current vehicle and the vehicle in front of the current vehicle is represented; there are two possible operations of the agent: and determining 'departure' and 'non-departure' of the station yard. Then, state prediction is carried out on a period of time in the future according to a certain time interval, and action decision is carried out, so that the dynamic optimization of the bus departure schedule can be realized; the intelligent agent corresponds to a bus dispatcher, and the environment is constructed based on a bus running simulation model; to realize the bus operation simulation, except for preset data such as line stop information, station vehicle number, departure timetable, departure type and the like, a vehicle travel time rule and a passenger flow 0D rule are required; the travel time of the bus is influenced by factors such as road conditions, traffic lights, station passengers getting on and off the bus, the passengers OD are influenced by factors such as weather, trip behaviors and trip time intervals, both are relatively complex random variables, and in order to fit the distribution rule, a Kernel Density Estimation (KDE) mode is adopted so as to generate random numbers required by simulation.

The bus operation simulation model mainly comprises three parts, namely, a first departure station, a vehicle arrival and departure station and a passenger getting-on and getting-off; when a vehicle arrives at a bus stop, a series of actions such as queuing for entering the bus stop, opening a vehicle door, getting on and off a passenger, closing the vehicle door and the like are carried out, and then the vehicle leaves the bus stop, wherein the service process can be represented as follows:

wherein the content of the first and second substances,

and

and

wherein the content of the first and second substances,

wherein m represents the number of nuclear people,

wherein the content of the first and second substances,

to representThe number of arrival waiting persons before the first trip reaches the station at the station j in the k direction.

The execution of each departure action may affect the effects of other actions; furthermore, the effect on the system after each action has been performed is often delayed, since the vehicle must reach the corresponding station in order to provide the passenger with service; therefore, in the running process of the bus, due to reasons such as passenger flow, road conditions and the like, deviation between the bus returning to the first station and the last station and a pre-planned time schedule often occurs after the bus finishes the preorder times, so that the number of the bus in the vehicle execution time schedule cannot be arranged; in the case of a station without a vehicle, if the agent is still making an departure action, a corresponding penalty should be incurred to better guide the agent toward the target state.

The return function (Reward function) is composed of two parts of the average waiting time and the punishment of the wrong departure instruction:

wherein, the time step T is in the value range of [0, T]；τ＝{s₁，a₁，s₂，a₂，…，s_T，a_TAnd is the track of state and action of one round; gamma is a discount factor, and the value range is (0, 1)]；p_tPunishment of wrong departure instruction is taken for the intelligent agent at the time step t; AWT is the waiting time of the average person; in order to maximize an objective function, a network structure comprising an input layer, an Actor (Actor) network and an evaluator (Critic) is designed by utilizing an artificial neural network; the artificial neural network has a complex mathematical structure, so that the artificial neural network has greater advantages when processing a complex public transportation system considering random traffic flow and passenger flow; the state space sequence(s)_t，s_t+1，…，s_T) The data is used as an input layer and is input into hidden layers of an operator network and a critic network, and the operator network makes decision-making action according to the current state of the public transportation system; the critic network is used for estimating the state value function

Further calculate the merit function

And participate in parameter updates for the actor network.

In summary, the invention optimizes the bus departure schedule by a bus operation simulation model based on the combination of traffic flow and passenger flow, thereby better conforming to the actual bus operation condition; secondly, the artificial neural network is trained by utilizing the PPO algorithm, and the efficiency is higher than that of the traditional method for solving the departure schedule after the network training is finished; more importantly, the PPO algorithm can capture the complex environment state in real time and quickly generate a bus departure schedule, and is a dynamic optimization strategy which cannot be compared with the traditional method; in addition, the performance of the heuristic algorithm may be affected by fine tuning of the underlying parameter settings, and is difficult to adapt to a new environment, while the invention has higher robustness in the face of a complex environment.

The above are all preferred embodiments of the present invention, and the protection scope of the present invention is not limited thereby, so: all equivalent changes made according to the structure, shape and principle of the invention are covered by the protection scope of the invention.

Claims

1. The bus departure timetable dynamic optimization algorithm based on deep reinforcement learning is characterized in that: a deep reinforcement learning method is introduced when a bus departure schedule is optimized, and the method comprises the following steps: s1, deep reinforcement learning is based on a simulation model, randomness that travel time between buses is possibly influenced by road conditions, traffic lights, weather and other factors is considered, passenger flow characteristics generated according to passenger requirements and OD rules are added, the defects of the conventional commercial traffic simulation software due to lack of functions of modeling and performance evaluation of passenger flow are overcome, the traffic flow and the passenger flow are expressed, and an actual bus system can be reflected more truly; s2, the reinforcement learning is combined with the principle of dynamic planning and supervised learning, complex scenes can be processed, and the reinforcement learning method has the capability of real-time learning and lifelong learning, so that the reinforcement learning method is very suitable for solving the problem of bus scheduling, and the reinforcement learning method can learn the optimal decision function by using reinforcement signals only by giving a group of feasible actions (Action) and the current bus system State (State), and further make the optimal decision Action.

2. The bus departure schedule dynamic optimization algorithm based on the deep reinforcement learning of claim 1, characterized in that: the reinforcement learning system involves two subjects, namely an Agent and an Environment; the public traffic system has various possible complex states, and can select six characteristics to form a state set s by taking a vehicle as an object_t＝{t_t，k_t，j_t，l_t，b_t，p_t}; wherein, t_tRepresents the time at time step t; k is a radical of_t、j_t’、l_t’Respectively representing the driving direction of the vehicle, the station section of the vehicle and the distance to the next station; b_tIndicating the remaining capacity of the vehicle compartment; p is a radical of_tThe total number of waiting people of all stations between the current vehicle and the vehicle in front of the current vehicle is represented; there are two possible operations of the agent: and determining 'departure' and 'non-departure' of the station yard. And then, the state is predicted and action decision is carried out according to a certain time interval for a period of time in the future, so that the dynamic optimization of the bus departure schedule can be realized.

3. The bus departure schedule dynamic optimization algorithm based on the deep reinforcement learning of claim 2, characterized in that: the intelligent agent corresponds to a bus dispatcher, and the environment is constructed based on a bus running simulation model; to realize the bus operation simulation, except for preset data such as line stop information, station vehicle number, departure timetable, departure type and the like, a vehicle travel time rule and a passenger flow OD rule are required; the travel time of the bus is influenced by factors such as road conditions, traffic lights, station passengers getting on and off the bus, the passengers OD are influenced by factors such as weather, trip behaviors and trip time intervals, both are relatively complex random variables, and in order to fit the distribution rule, a Kernel Density Estimation (KDE) mode is adopted so as to generate random numbers required by simulation.

4. The bus departure schedule dynamic optimization algorithm based on deep reinforcement learning of claim 3, characterized in that: the bus operation simulation model mainly comprises three parts, namely, a first departure station, a vehicle arrival and departure station and a passenger getting-on and getting-off; when a vehicle arrives at a bus stop, a series of actions such as queuing for entering the bus stop, opening a vehicle door, getting on and off a passenger, closing the vehicle door and the like are carried out, and then the vehicle leaves the bus stop, wherein the service process can be represented as follows:

wherein the content of the first and second substances,

and

and

wherein the content of the first and second substances,

wherein m represents the number of nuclear people,

wherein the content of the first and second substances,

5. The bus departure schedule dynamic optimization algorithm based on the deep reinforcement learning of claim 4, characterized in that: the execution of each departure action may affect the effects of other actions; furthermore, the effect on the system after each action has been performed is often delayed, since the vehicle must reach the corresponding station in order to provide the passenger with service; therefore, in the running process of the bus, due to reasons such as passenger flow, road conditions and the like, deviation between the bus returning to the first station and the last station and a pre-planned time schedule often occurs after the bus finishes the preorder times, so that the number of the bus in the vehicle execution time schedule cannot be arranged; in the case of a station without a vehicle, if the agent is still making an departure action, a corresponding penalty should be incurred to better guide the agent toward the target state.

6. The bus departure schedule dynamic optimization algorithm based on the deep reinforcement learning of claim 5, characterized in that: the return function (Reward function) is composed of two parts of the average waiting time and the punishment of the wrong departure instruction:

wherein, the time step T is in the value range of [0, T]；τ={s₁，a₁，s₂，a₂，…，s_T，a_TAnd is the track of state and action of one round; gamma is a discount factor, and the value range is (0, 1)]；p_tPunishment of wrong departure instruction is taken for the intelligent agent at the time step t; AWT is the waiting time of the average person; in order to maximize an objective function, a network structure comprising an input layer, an Actor (Actor) network and an evaluator (Critic) is designed by utilizing an artificial neural network; the artificial neural network has a complex mathematical structure, so that the artificial neural network has greater advantages when processing a complex public transportation system considering random traffic flow and passenger flow; the state space sequence(s)_t，s_t+1，…，s_T) The data is used as an input layer and is input into hidden layers of an operator network and a critic network, and the operator network makes decision-making action according to the current state of the public transportation system; the critic network is used for estimating the state value function

Further calculate the merit function

And participate in parameter updates for the actor network.