CN113326993A - Shared bicycle scheduling method based on deep reinforcement learning - Google Patents

Shared bicycle scheduling method based on deep reinforcement learning Download PDF

Info

Publication number
CN113326993A
CN113326993A CN202110744265.2A CN202110744265A CN113326993A CN 113326993 A CN113326993 A CN 113326993A CN 202110744265 A CN202110744265 A CN 202110744265A CN 113326993 A CN113326993 A CN 113326993A
Authority
CN
China
Prior art keywords
variable
dispatching
scheduling
vehicle
shared bicycle
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110744265.2A
Other languages
Chinese (zh)
Other versions
CN113326993B (en
Inventor
肖峰
涂雯雯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southwestern University Of Finance And Economics
Original Assignee
Southwestern University Of Finance And Economics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southwestern University Of Finance And Economics filed Critical Southwestern University Of Finance And Economics
Publication of CN113326993A publication Critical patent/CN113326993A/en
Application granted granted Critical
Publication of CN113326993B publication Critical patent/CN113326993B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/10Geometric CAD
    • G06F30/15Vehicle, aircraft or watercraft design
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • G06F30/27Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0631Resource planning, allocation, distributing or scheduling for enterprises or organisations
    • G06Q10/06315Needs-based resource requirements planning or analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/40Business processes related to the transportation industry
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2111/00Details relating to CAD techniques
    • G06F2111/04Constraint-based CAD
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2111/00Details relating to CAD techniques
    • G06F2111/08Probabilistic or stochastic CAD
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2119/00Details relating to the type or aim of the analysis or the optimisation
    • G06F2119/12Timing analysis or timing optimisation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • General Physics & Mathematics (AREA)
  • Economics (AREA)
  • Strategic Management (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Geometry (AREA)
  • Entrepreneurship & Innovation (AREA)
  • General Health & Medical Sciences (AREA)
  • Marketing (AREA)
  • Tourism & Hospitality (AREA)
  • Artificial Intelligence (AREA)
  • General Business, Economics & Management (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Operations Research (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Development Economics (AREA)
  • Computing Systems (AREA)
  • Game Theory and Decision Science (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Quality & Reliability (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Educational Administration (AREA)
  • Automation & Control Theory (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)

Abstract

The invention discloses a shared bicycle dispatching method based on deep reinforcement learning, which comprises the following steps: s1: dividing a dispatching area of the shared bicycles to obtain dispatching area units, and determining running environment variables of the shared bicycles; s2: determining a scheduling variable of the shared bicycle; s3: constructing a vehicle dispatching optimization model of the shared bicycle; s4: the bicycle dispatching method comprises the steps of constructing a shared bicycle dispatching frame by utilizing an average field theory based on a vehicle dispatching optimization model of shared bicycles, and completing shared bicycle dispatching by utilizing the shared bicycle dispatching frame. The shared bicycle dispatching optimization method based on reinforcement learning provided by the invention is beneficial to intelligently solving the short-term and long-term dispatching optimization problem of the shared bicycles of a large-scale road network under random and complex dynamic environments. The method considers the supply and demand change of the environment and the interaction influence of the scheduling decision and the environment in the future time, does not need to predict the demand in advance or carry out manual data processing, and is not influenced by the calculation efficiency and accuracy of demand prediction.

Description

Shared bicycle scheduling method based on deep reinforcement learning
Technical Field
The invention belongs to the technical field of vehicle scheduling, and particularly relates to a shared bicycle scheduling method based on deep reinforcement learning.
Background
In the past research, the bicycle scheduling optimization problem is usually solved by dividing the scheduling time into different time periods and independently searching the optimal scheduling strategy in each time period based on the division. However, the scheduling policy for the last time period will affect the supply and demand environment for the next and future time periods. For a time-segment based isolated policy optimization method, it does not take into account the supply and demand conditions for future time segments and the resulting impact of the implemented policy. In this way, the optimal strategy for this period of time does not necessarily promote a higher actual traffic in the future time, and even causes a lower actual traffic in the future. Therefore, with the time-period-based isolated policy optimization method, an optimal global policy for full scheduling time is not necessarily obtained.
Disclosure of Invention
The invention aims to solve the problem of shared bicycle scheduling in a long-term scheduling process, a dynamic environment and a large-scale network, and provides a shared bicycle scheduling method based on deep reinforcement learning.
The technical scheme of the invention is as follows: a shared bicycle scheduling method based on deep reinforcement learning comprises the following steps:
s1: dividing a dispatching area of the shared bicycles to obtain dispatching area units, and determining running environment variables of the shared bicycles;
s2: determining a dispatching variable of the shared bicycle according to the running environment variable of the shared bicycle based on the dispatching area unit;
s3: constructing a bicycle dispatching optimization model of the shared bicycles according to the dispatching variables of the shared bicycles;
s4: the bicycle dispatching method comprises the steps of constructing a shared bicycle dispatching frame by utilizing an average field theory based on a vehicle dispatching optimization model of shared bicycles, and completing shared bicycle dispatching by utilizing the shared bicycle dispatching frame.
Further, in step S1, the specific method for dividing the dispatching area of the shared bicycles includes: dividing a dispatching area of a shared bicycle into a plurality of equilateral hexagons as dispatching area units, and defining a global label variable eta of each dispatching area unit5A horizontal direction label variable m and a vertical direction label variable h, which satisfy the following relational expression:
η5(m,h)=(|M|+1)m+h
Figure BDA0003142305910000011
h∈{0,1,...,M}
wherein ,η5∈M′,M′={0,1,...,((M+1)2-1) }, M denotes a maximum value of a horizontal direction tag variable or a vertical direction tag variable of a scheduling area unit, and M' denotes a unit tag set of the scheduling area unit;
in step S1, the operating environment variables of the shared bicycles include a time variable and a city fixed warehouse location set variable;
the time variable comprises a time step variable T, a time step variable set T and a maximum value variable T of the time stepmaxWherein T ∈ T, T ═ {0,1max};
The variable of the urban fixed warehouse location set comprises a fixed warehouse location set etaw
Further, in step S2, the scheduling variables of the shared bicycles include a policy execution state variable class, a supply and demand environment variable class, a riding trip variable class, and a scheduling policy variable class;
the policy enforcement state variable class comprises a policy enforcement state variable tr, wherein tr belongs to {0,1 };
at time step t, the supply and demand environment variable class comprises the shared bicycle travel demand variable of the scheduling area unit
Figure BDA0003142305910000021
Shared bicycle supply variable of dispatch area unit when policy enforcement state variable tr is 0
Figure BDA0003142305910000022
Shared bicycle supply variable of scheduling area unit when policy enforcement state variable tr is 1
Figure BDA0003142305910000023
At time step t, the riding trip variable class comprises a global label eta of a scheduling region unit where an OD starting point of the shared bicycle trip is located2And global label eta of scheduling area unit where OD destination of shared bicycle trip is located3OD tag variable (η) for shared bicycle trips23) And OD flow of shared bicycle trip
Figure BDA0003142305910000024
Shared bicycle slave η2Go out and arrive at3Travel flow rate of
Figure BDA0003142305910000025
η5Resulting actual bicycle-shared travel variable
Figure BDA0003142305910000026
and η5Is shared byActual suction volume of vehicle
Figure BDA0003142305910000027
At time step t, the scheduling strategy variable class comprises a scheduling vehicle label set I, a scheduling vehicle label variable I and a scheduling vehicle starting unit label variable
Figure BDA0003142305910000028
Dispatch vehicle arrival unit label variable
Figure BDA0003142305910000029
Set kappa of moving direction variables of dispatching vehicle1Scheduling ratio variable set kappa2Dispatching vehicle slave
Figure BDA00031423059100000210
Variable moving direction to six adjacent regular hexagons
Figure BDA00031423059100000211
Dispatch ratio variable for a dispatch vehicle
Figure BDA00031423059100000212
Dispatching strategy of dispatching vehicle
Figure BDA00031423059100000213
Maximum capacity of cabin of dispatching vehicle
Figure BDA00031423059100000214
Dispatching vehicle slave
Figure BDA00031423059100000215
Pick up and place in
Figure BDA00031423059100000216
Shared bicycle number variable
Figure BDA00031423059100000217
Arrival of dispatching vehicle
Figure BDA00031423059100000218
And is
Figure BDA00031423059100000219
Is of ηwTime dispatching vehicle put on etai,1The ratio alpha of the number of the shared bicycles to the number of the vehicles in the vehicle compartmentwhAnd before the dispatching vehicle implements the dispatching strategy, the eta of the dispatching vehicle under the condition that the dispatching strategy is expected to be implemented by the former dispatching vehicle5Predicted cumulative increase/decrease amount of supply amount of (2)
Figure BDA00031423059100000220
Increased revenue after dispatching vehicle implements dispatching strategy
Figure BDA00031423059100000221
And total amount of shared bicycles Z stored in the fixed warehouse of the city at the end of the scheduling cycle timewarehouse
Where, I ═ {0, 1., N }, N denotes the maximum value of the dispatcher tag variable, I ∈ I, κ1={0,1,...,5},κ2={0,0.25,0.5,0.75},
Figure BDA0003142305910000031
Further, step S4 includes the following sub-steps:
s41: determining elements of a shared bicycle dispatching frame based on a vehicle dispatching optimization model of the shared bicycle;
s42: determining average action by utilizing a one-hot coding mode;
s43: defining experience pool variables and training round related variables of a shared bicycle dispatching frame;
s44: and constructing the shared bicycle dispatching frame according to the elements, the average action, the experience pool variable and the training round related variable of the shared bicycle dispatching frame based on an average field theory.
Further, in step S41, the elements sharing the bicycle dispatching frame include a status
Figure BDA0003142305910000032
Behavior parameter atAnd a reward function, wherein,
Figure BDA0003142305910000033
i is 0, N represents the state of the dispatch vehicle at time step variable t,
Figure BDA0003142305910000034
i is 0, and N represents a scheduling strategy of a scheduling vehicle at a time step variable t;
the reward function comprises a reward function for actually increasing the traffic of the dispatching vehicle
Figure BDA0003142305910000035
Average increase traffic rewarding function of dispatching vehicle
Figure BDA0003142305910000036
And dispatching vehicle overall increase go function
Figure BDA0003142305910000037
The concrete formula is as follows:
Figure BDA0003142305910000038
Figure BDA0003142305910000039
Figure BDA00031423059100000310
wherein ,αrwThe scaling factor of the reward function is represented,
Figure BDA00031423059100000311
when indicating implementation of scheduling policy
Figure BDA00031423059100000312
The actual amount of the business trip of (c),
Figure BDA00031423059100000313
indicating when no scheduling policy is implemented
Figure BDA00031423059100000314
The actual amount of the business trip of (c),
Figure BDA00031423059100000315
when indicating implementation of scheduling policy
Figure BDA00031423059100000316
The actual amount of the business trip of (c),
Figure BDA00031423059100000317
when indicating not to implement scheduling policy
Figure BDA00031423059100000318
The actual amount of the business trip of (c),
Figure BDA00031423059100000319
representing the time step variable tth
Figure BDA00031423059100000320
The number of vehicles to be scheduled in the interior,
Figure BDA00031423059100000321
representing the time step variable tth
Figure BDA00031423059100000322
The number of vehicles to be scheduled in the interior,
Figure BDA00031423059100000323
indicating when scheduling policy is implemented η5The actual amount of the business trip of (c),
Figure BDA00031423059100000324
indicating when scheduling policy is not implemented η5N represents the maximum value of the tag variable of the dispatching car, η5A global label representing each scheduling region unit, and M' represents a unit label set of the scheduling region unit.
Further, in step S42, the specific method for determining the average motion is: scheduling strategy for rewriting scheduling vehicle by using one-hot coding mode
Figure BDA00031423059100000325
Get the average motion
Figure BDA00031423059100000326
The calculation formulas are respectively as follows:
Figure BDA0003142305910000041
Figure BDA0003142305910000042
wherein ,
Figure BDA0003142305910000043
variable denoted 0 or 1, pdimRepresents the dimension of the dispatching strategy, N represents the maximum value of the dispatching vehicle label variable,
Figure BDA0003142305910000044
represents ineAction policy of ineTag variables representing other dispatchers than dispatcher i.
Further, in step S43, the experience pool variables of the shared bicycle dispatching framework include an experience pool
Figure BDA0003142305910000045
And empirical tank capacity
Figure BDA0003142305910000046
The training-round related variables comprise a training-round number Episode and an updated training-round number EpisodeupnetTarget network updateA weight coefficient omega and a cumulative reward discount factor gamma.
Further, step S44 includes the following sub-steps:
s441: initializing an experience pool
Figure BDA0003142305910000047
Setting empirical tank capacity
Figure BDA0003142305910000048
Target network updated weight coefficient omega and reward function scaling coefficient alpharwAccumulated reward discount factor gamma, initial given offer amount
Figure BDA0003142305910000049
Shared bicycle travel demand variable
Figure BDA00031423059100000410
And sharing bicycle slave η2Go out and arrive at3Travel flow rate of
Figure BDA00031423059100000411
And based on the number of training rounds EpisodeupnetCircularly performing the steps S442-S445;
s442: updating the shared bicycle running environment when the strategy execution state variable tr is 0;
s443: updating the status of each dispatching vehicle
Figure BDA00031423059100000412
And scheduling policy
Figure BDA00031423059100000413
S444: updating the shared bicycle running environment when the strategy execution state variable tr is 1;
s445: updating the status of the next time step for each dispatch vehicle
Figure BDA00031423059100000414
And average motion
Figure BDA00031423059100000415
And updating the increased income of the dispatching vehicle after implementing the dispatching strategy according to the reward function
Figure BDA00031423059100000416
S446: based on the updating process of the steps S442-S445, a shared bicycle dispatching frame is constructed by using a reinforcement learning algorithm, and shared bicycle dispatching is completed by using the shared bicycle dispatching frame.
Further, in step S442, a specific method of updating the shared bicycle operating environment when the policy execution state variable tr is equal to 0 is as follows: updating and calculating OD label variable (eta) of shared bicycle trip2,η3) Shared bicycle path traffic
Figure BDA00031423059100000417
η5Resulting actual bicycle-shared travel variable
Figure BDA00031423059100000418
η5The actual suction volume of the shared bicycle
Figure BDA00031423059100000419
And supply amount when policy execution state variable tr is 0
Figure BDA00031423059100000420
In step S444, a specific method of updating the shared bicycle operating environment when the policy execution state variable tr is equal to 1 is as follows: updating a calculation behavior parameter atFrom eta of dispatching vehiclei,0Pick up and place on reach ηi,1Shared bicycle number variable
Figure BDA0003142305910000051
And supply amount when policy execution state variable tr is 1
Figure BDA0003142305910000052
In step S446, the reinforcement Learning algorithm adopts a strategy gradient method or a Q-Learning method;
each dispatching vehicle comprises a strategy model and a value model; in the strategy gradient method, a strategy model of each dispatching vehicle comprises a strategy estimation network and a strategy target network; the strategy estimation network inputs the state of the dispatching vehicle by constructing a neural network
Figure BDA0003142305910000053
Output as scheduling policy
Figure BDA0003142305910000054
The strategy target network inputs the state of the dispatching vehicle in the next time step variable by constructing a neural network
Figure BDA0003142305910000055
Scheduling strategy for outputting next time step variable
Figure BDA0003142305910000056
In the strategy gradient method and the Q-Learning method, a value model of each dispatching car comprises a value estimation network and a value target network; the value estimation network is constructed by constructing a neural network with parameters of
Figure BDA0003142305910000057
Input the state of the dispatching car
Figure BDA0003142305910000058
Scheduling policy
Figure BDA0003142305910000059
And average motion
Figure BDA00031423059100000510
Output Q value function QiWherein the Q value function refers to a state-action value function in the reinforcement learning algorithm, which represents the accumulated reward value obtained by the dispatching vehicle; the value targetThe network is constructed by constructing a neural network with parameters of
Figure BDA00031423059100000511
Inputting the state of the next time step variable of the dispatching vehicle
Figure BDA00031423059100000512
Scheduling strategy for next time step variable
Figure BDA00031423059100000513
And average motion of next time step variable
Figure BDA00031423059100000514
Output target Q value function
Figure BDA00031423059100000515
In the Q-Learning method, the strategy model of the dispatching truck is based on a formula
Figure BDA00031423059100000516
And
Figure BDA00031423059100000517
probability sampling and selecting to obtain action
Figure BDA00031423059100000518
wherein ,
Figure BDA00031423059100000519
representing t-1 time step ineAverage motion of ineA tag variable, ω, representing a dispatcher vehicle other than dispatcher vehicle idThe parameters of the policy are represented by,
Figure BDA00031423059100000520
is QiThe form of the function is expressed as,
Figure BDA00031423059100000521
representing the function of computation of probability of action, AiTo represent
Figure BDA00031423059100000522
Is based on the motion space set
Figure BDA00031423059100000523
Updating
Figure BDA00031423059100000524
Substitution formula
Figure BDA00031423059100000525
According to
Figure BDA00031423059100000526
Probabilistic sampling and obtaining
Figure BDA00031423059100000527
Will be provided with
Figure BDA00031423059100000528
The action is taken as the final selection action of the strategy model of the dispatching vehicle;
if the reinforcement learning algorithm adopts a strategy gradient method, the method will be
Figure BDA00031423059100000529
Store to experience pool
Figure BDA00031423059100000530
And from experience pools
Figure BDA00031423059100000531
Randomly sampling a batch of samples
Figure BDA00031423059100000532
According to the sample
Figure BDA00031423059100000533
Updating the neural network parameters of the value estimation network according to the accumulated return discount factor gamma and the loss function, and updating the neural network parameters of the strategy model by using a gradient descent method(ii) a Number of training rounds per interval EpisodeupnetAccording to the updated weight coefficient omega of the target network, the parameter theta of the strategy model neural network is adjustediParameters of neural network of sum value model
Figure BDA0003142305910000061
Neural network parameters respectively transmitted to corresponding strategy target networks
Figure BDA0003142305910000062
Neural network parameters for a sum value target network
Figure BDA0003142305910000063
wherein ,
Figure BDA0003142305910000064
representing the global state, st+1The global state representing the next time step,
Figure BDA0003142305910000065
a prize value indicative of a dispatch vehicle,
Figure BDA0003142305910000066
denotes the mean motion, st,jRepresenting the global state, s, of the samplet+1,jA global state representing the next time step of the sample,
Figure BDA0003142305910000067
the strategy for representing the sample to be sampled,
Figure BDA0003142305910000068
which represents the average motion of the sampled samples,
Figure BDA0003142305910000069
a reward value representing a dispatch vehicle sampling a sample;
if the reinforcement Learning algorithm adopts the Q-Learning method, the Q-Learning method will be used
Figure BDA00031423059100000610
Store to experience pool
Figure BDA00031423059100000611
And again from experience pools
Figure BDA00031423059100000612
Randomly sampling a batch of samples
Figure BDA00031423059100000613
In a sample
Figure BDA00031423059100000614
Updating a neural network parameter of the value estimation network according to the accumulated return discount factor gamma and the loss function; number of training rounds per interval EpisodeupnetAnd transferring the neural network parameters of the value model to the neural network parameters of the value target network according to the updated weight coefficient omega of the target network.
The invention has the beneficial effects that:
(1) the shared bicycle dispatching optimization method based on reinforcement learning provided by the invention is beneficial to intelligently solving the short-term and long-term dispatching optimization problem of the shared bicycles of a large-scale road network under random and complex dynamic environments. The method does not need to predict the demand in advance or carry out manual data processing, and is not influenced by the calculation efficiency and accuracy of demand prediction. And the method is not an optimal strategy for each time segment, but is an overall optimization method of the whole scheduling process, which considers the supply and demand change of the future time segment and the influence of the scheduling decision on the supply and demand of the next time segment.
(2) The dynamic optimization scheduling strategy provided by the invention can improve the scheduling operation efficiency. The bicycle sharing system has the advantages that the bicycle sharing capacity and the bicycle sharing utilization rate are improved, and the loss of the demands of bicycle sharing users is reduced. The idle rate of shared bicycles in roads is reduced, and the number of idle vehicles which are excessively high and accumulated in certain areas is reduced. The waste of shared resources is reduced, and the phenomenon of urban environment caused by stacking of a large number of idle vehicles is improved.
(3) The actual running amount of the bicycle sharing users is increased, so that the sharing rate in the connection traffic can be improved, and the running efficiency of the public traffic system is improved. The service quality of the shared bicycles is improved, the shared bicycles are encouraged to replace motor vehicles to go out, the urban congestion and the tail gas emission of the motor vehicles are reduced, and the social welfare is increased.
Drawings
FIG. 1 is a flow chart of a shared bicycle scheduling method;
fig. 2 is a region unit coordinate diagram based on equal hexagon division.
Detailed Description
The embodiments of the present invention will be further described with reference to the accompanying drawings.
Before describing specific embodiments of the present invention, in order to make the solution of the present invention more clear and complete, the definitions of the abbreviations and key terms appearing in the present invention will be explained first:
OD traffic volume: indicating the amount of traffic going between the endpoints. "O" is derived from ORIGIN, english, and refers to the starting point of a trip, and "D" is derived from DESTINATION, english, and refers to the DESTINATION of a trip.
MFMARL algorithm: mean Field Multi-Agent relationship Learning, a Multi-Agent Reinforcement Learning algorithm based on the Mean Field game theory.
As shown in fig. 1, the present invention provides a shared bicycle scheduling method based on deep reinforcement learning, comprising the following steps:
s1: dividing a dispatching area of the shared bicycles to obtain dispatching area units, and determining running environment variables of the shared bicycles;
s2: determining a dispatching variable of the shared bicycle according to the running environment variable of the shared bicycle based on the dispatching area unit;
s3: constructing a bicycle dispatching optimization model of the shared bicycles according to the dispatching variables of the shared bicycles;
s4: the bicycle dispatching method comprises the steps of constructing a shared bicycle dispatching frame by utilizing an average field theory based on a vehicle dispatching optimization model of shared bicycles, and completing shared bicycle dispatching by utilizing the shared bicycle dispatching frame.
In the embodiment of the invention, in the sequence decision problem, the interaction influence of the supply and demand environment and the implementation scheduling strategy is considered, and the dynamic scheduling optimization problem of the shared bicycles is provided. According to the scheduling optimization cycle time length and whether surplus shared self-vehicles are considered to be placed to the urban fixed warehouse or not, the scheduling optimization problem can be divided into two problems: the bicycle scheduling optimization problem without a fixed warehouse and the bicycle scheduling optimization problem with a fixed warehouse are considered.
In the scheduling optimization problem of the shared bicycles, the optimization target is not to pursue the maximized actual running amount in a single time period or pursue the high scheduling efficiency of a single scheduling vehicle, but the maximization of the global running amount is realized through the dynamic scheduling strategy optimization with the cooperation property in the whole scheduling period. Further, in achieving the above objectives, it is contemplated herein that in the presence of an urban warehouse, the scheduling strategy includes the act of placing redundant vehicles into the warehouse to achieve a reduction in redundant bicycles in the roadway.
The present invention constructs a schedule optimization process for sharing bicycles as shown in fig. 3. In the dynamic scheduling optimization process, the bicycle renting, riding, parking and scheduling processes and the supply and demand change conditions are considered. At each time step, each dispatching vehicle picks up a certain number of shared bicycles from the current unit and loads the shared bicycles into the dispatching vehicle cabin, and then the dispatching vehicle drives to the arrival unit and places all the shared bicycles in the cabin in the arrival unit.
In the embodiment of the present invention, as shown in fig. 2, the specific method of the dispatching area of the bicycle is: dividing a dispatching area of a shared bicycle into a plurality of equilateral hexagons as dispatching area units, and defining a global label variable eta of each dispatching area unit5A horizontal direction label variable m and a vertical direction label variable h, which satisfy the following relational expression:
η5(m,h)=(|M|+1)m+h
Figure BDA0003142305910000081
h∈{0,1,...,M}
wherein ,η5∈M′,M′={0,1,...,((M+1)2-1) }, M denotes a maximum value of a horizontal direction tag variable or a vertical direction tag variable of a scheduling area unit, and M' denotes a unit tag set of the scheduling area unit;
in step S1, the operating environment variables of the shared bicycles include time variables, city fixed warehouse location set variables, and supply parameters;
the time variable comprises a time step variable T, a time step variable set T and a maximum value variable T of the time stepmaxWherein T ∈ T, T ═ {0,1max};
The variable of the urban fixed warehouse location set comprises a fixed warehouse location set etaw
Some units in a city may be set as fixed warehouses of a dispatching city, and when dispatching measures are implemented, the dispatching cars may put idle shared bicycles in the fixed warehouses of the city of the units. There is no upper limit to the capacity of the fixed warehouse in the city, and the bicycles put in the warehouse by the dispatching vehicle will not be transported out and given to the rider for use. When the unit where the destination of the rider goes out is a city fixed warehouse, the shared bicycle parked by the rider is not placed in the city fixed warehouse, but is stored in the unit and can be still used by the rider in the future. The set of variable locations for the fixed city warehouses may include the locations of the units in which the fixed city warehouses are located within all of the regions.
The supply parameter comprises a first supply coefficient cdisAnd a second supply coefficient cinitial(ii) a Wherein the first supply coefficient cdisThe determination method comprises the following steps: calculating the required value of each scheduling area unit at each time step variable according to the shared bicycle required data, and taking the 40 quantiles of the required values every 10 minutes in all the scheduling area units as a first supply coefficient cdis(ii) a Second supply factor cinitialThe determination method comprises the following steps: the supply amount and the first supply coefficient c of the shared bicycles in each dispatching area unitdisAs the second supply coefficient cinitial
Assuming that at an initial time, the shared bicycle supply for each unit in the area is evenly distributed. In order to make the influence research of the supply quantity on the trip analysis have generalization, the invention does not directly give the supply quantity number, but determines the supply quantity value according to the relation between the supply quantity and the demand. The invention defines a first supply factor cdisAnd taking the 40 quantiles of the sequences of the riding requirements of all the units every 10min for the requirement value of each unit at each time step calculated according to the requirement data. Here, the 40 quantile is selected instead of the mean value because the mean value of the data is more susceptible to the extreme value. The 40 quantile is the 40 th% of all the numerical values after being arranged from small to large. The method can avoid the problem that the analysis result can lose generalization due to the fact that a small number of high demands exist in the sequence of the riding demands of all units in each time step.
The invention defines a second supply parameter cinitialThe supply amount and the first supply coefficient c of the shared bicycle in each unit at the initial timedisThe relation ratio of (c). In the present invention, the second supply parameter cinitialIs selected as five values, cinitialE {20,50,100,200,500,1000 }. The invention defines the supply of shared bicycles in each unit as c at the initial timedisAnd cinitialAnd take its integer value down.
In the embodiment of the present invention, in step S2, the scheduling variables of the shared bicycles include a policy execution state variable class, a supply and demand environment variable class, a riding trip variable class, and a scheduling policy variable class;
the policy enforcement state variable class comprises a policy enforcement state variable tr, wherein tr belongs to {0,1 };
at time step t, the supply and demand environment variable class comprises the shared bicycle travel demand variable of the scheduling area unit
Figure BDA0003142305910000091
Shared bicycle supply variable of dispatch area unit when policy enforcement state variable tr is 0
Figure BDA0003142305910000092
(indicating number of shared bicycles available) and policy enforcement statusShared bicycle supply variable of dispatch area unit when variable tr is 1
Figure BDA0003142305910000093
(indicating the number of shared bicycles available for use);
at time step t, the riding trip variable class comprises a global label eta of a scheduling region unit where an OD starting point of the shared bicycle trip is located2And global label eta of scheduling area unit where OD destination of shared bicycle trip is located3OD tag variable (η) for shared bicycle trips23) And OD flow of shared bicycle trip
Figure BDA0003142305910000094
Shared bicycle slave η2Go out and arrive at3Travel flow rate of
Figure BDA0003142305910000095
η5Resulting actual bicycle-shared travel variable
Figure BDA0003142305910000096
and η5The actual suction volume of the shared bicycle
Figure BDA0003142305910000097
η2 and η3The conversion of the horizontal and vertical labels is the same as the label relation of the scheduling area unit; in units of eta2As starting point units
Figure BDA0003142305910000098
Is 1 in units of3As a destination unit
Figure BDA0003142305910000099
The sum of the ratios of (a) to (b) is 1; when eta2=η5By unit η2As starting point units
Figure BDA00031423059100000910
The sum is equal to the actual running amount
Figure BDA00031423059100000911
When eta3=η5By unit η3As a destination unit
Figure BDA00031423059100000912
The sum of which is equal to the actual suction amount
Figure BDA00031423059100000913
At time step t, the scheduling strategy variable class comprises a scheduling vehicle label set I, a scheduling vehicle label variable I and a scheduling vehicle starting unit label variable
Figure BDA00031423059100000914
Dispatch vehicle arrival unit label variable
Figure BDA00031423059100000915
Set kappa of moving direction variables of dispatching vehicle1Scheduling ratio variable set kappa2Dispatching vehicle slave
Figure BDA00031423059100000916
Variable moving direction to six adjacent regular hexagons
Figure BDA00031423059100000917
Dispatch ratio variable for a dispatch vehicle
Figure BDA0003142305910000101
Dispatching strategy of dispatching vehicle
Figure BDA0003142305910000102
Maximum capacity of cabin of dispatching vehicle
Figure BDA0003142305910000103
Dispatching vehicle slave
Figure BDA0003142305910000104
Pick up and place in
Figure BDA0003142305910000105
Shared bicycle number variable
Figure BDA0003142305910000106
Arrival of dispatching vehicle
Figure BDA0003142305910000107
And is
Figure BDA0003142305910000108
Is of ηwTime dispatching vehicle put on etai,1The ratio alpha of the number of the shared bicycles to the number of the vehicles in the vehicle compartmentwhAnd before the dispatching vehicle implements the dispatching strategy, the eta of the dispatching vehicle under the condition that the dispatching strategy is expected to be implemented by the former dispatching vehicle5Predicted cumulative increase/decrease amount of supply amount of (2)
Figure BDA0003142305910000109
Increased revenue after dispatching vehicle implements dispatching strategy
Figure BDA00031423059100001010
And total amount of shared bicycles Z stored in the fixed warehouse of the city at the end of the scheduling cycle timewarehouse
Where, I ═ {0, 1., N }, N denotes the maximum value of the dispatcher tag variable, I ∈ I, κ1={0,1,...,5},κ2={0,0.25,0.5,0.75},
Figure BDA00031423059100001011
When in use
Figure BDA00031423059100001012
When the number is 0 to 5, the dispatching vehicle moves to adjacent units of left lower, right side, left upper, left side, right lower and right upper, and the relation is as follows:
Figure BDA00031423059100001013
Figure BDA00031423059100001014
in the formula, M represents a horizontal direction label variable of a scheduling area unit, h represents a vertical direction label variable of the scheduling area unit, M' represents a unit label set of the scheduling area unit, and T represents a time step variable set.
Figure BDA00031423059100001015
and ηi,1The conversion of the horizontal and vertical labels is the same as the label relation of the scheduling area unit; the dispatch ratio variable for dispatcher i may be four percent, i.e.
Figure BDA00031423059100001016
Unit for indicating picking up of dispatching vehicle i
Figure BDA00031423059100001017
Shared bicycle number accounting for unit at this time
Figure BDA00031423059100001018
Is a percentage of the number of supplies
Figure BDA00031423059100001019
When unit eta5When the number of vehicles of the shared bicycle which is expected to be adjusted away accumulatively is larger than the number of the vehicles placed on the shared bicycle, the expected accumulated increase and decrease amount
Figure BDA00031423059100001020
The value is negative, otherwise, when the number of vehicles of the shared bicycle which is expected to be adjusted away is less than or equal to the placing number, the expected accumulated increase and decrease quantity
Figure BDA00031423059100001021
Are non-negative values.
In the embodiment of the present invention, in step S3, the bicycle sharing vehicle dispatching optimization model specifically includes:
Figure BDA0003142305910000111
s.t.
Figure BDA0003142305910000112
Figure BDA0003142305910000113
Figure BDA0003142305910000114
Figure BDA0003142305910000115
Figure BDA0003142305910000116
Figure BDA0003142305910000117
Figure BDA0003142305910000118
Figure BDA0003142305910000119
Figure BDA00031423059100001110
Figure BDA00031423059100001111
Figure BDA00031423059100001112
Figure BDA00031423059100001113
Figure BDA00031423059100001114
Figure BDA00031423059100001115
in the vehicle dispatching optimization model, the increased benefit is generated after the dispatching vehicle implements the dispatching strategy
Figure BDA00031423059100001116
Maximizing an objective function as a short-term scheduling optimization problem for shared bikes
Figure BDA00031423059100001117
The calculation formula is
Figure BDA00031423059100001118
Wherein T represents a time step, TmaxA variable representing a maximum value of a time step, i represents a dispatcher vehicle tag variable, N represents a total number of dispatcher vehicle tag variables,
Figure BDA0003142305910000121
representing a dispatching strategy of a dispatching vehicle;
the invention sets the benefits to maximize the benefits of sharing bicycles compared to the case of not performing any scheduling strategy. Decision variables are action decisions for dispatch vehicles
Figure BDA0003142305910000122
Including the direction of movement of the dispatching truck
Figure BDA0003142305910000123
And scheduling ratio
Figure BDA0003142305910000124
When the strategy execution state variable tr of the time step variable t is equal to 0, the decision variable is an action decision
Figure BDA0003142305910000125
Action decision
Figure BDA0003142305910000126
Is calculated by the formula
Figure BDA0003142305910000127
tr=0,i∈I,
Figure BDA0003142305910000128
wherein ,
Figure BDA0003142305910000129
indicating slave η of dispatching vehiclei,0The moving direction variable of the six adjacent regular hexagons,
Figure BDA00031423059100001210
a dispatch ratio variable representing a dispatch vehicle;
when the strategy execution state variable tr of the time step variable t is equal to 0, the global label variable eta of the scheduling area unit5And a global label eta of a scheduling area unit where the OD starting point of the shared bicycle trip is located2At the same time, sharing the OD tag variable (η) of the bicycle trip2,η3) Shared bicycle path traffic
Figure BDA00031423059100001211
Is calculated by the formula
Figure BDA00031423059100001212
And is
Figure BDA00031423059100001213
tr=0,η5∈M′,η2∈M′,η3e.M', where INT (-) denotes taking down integer values,
Figure BDA00031423059100001214
a shared bicycle travel demand variable representing a dispatch area unit,
Figure BDA00031423059100001215
represents a shared bicycle supply variable in the dispatch area unit when the initial given supply is t-0,
Figure BDA00031423059100001216
represents a shared bicycle supply variable within a dispatch area unit when the policy enforcement state variable tr equals 1,
Figure BDA00031423059100001217
representing shared bicycle slave eta2Go out and arrive at3M' represents a unit label set of a scheduling area unit;
Figure BDA00031423059100001218
provisioning amount of scheduling area unit tag variable for policy enforcement state tr ═ 0
Figure BDA00031423059100001219
And the need to schedule the generation of area unit tag variables
Figure BDA00031423059100001220
The smaller of these. Path flow
Figure BDA00031423059100001221
Is not greater than the referenced scheduling area unit label variable η5Actual amount of work and η2Is a starting point and eta3Rate of travel flow to destination
Figure BDA00031423059100001222
An integer of the product of (a).
Shared bicycle slave η2Go out and arrive at3Travel flow rate of
Figure BDA00031423059100001223
The method meets the conservation relation between the path flow and the OD flow, and uses the global label eta of the scheduling area unit where the OD starting point of the shared bicycle trip is positioned2Starting trip flow rate
Figure BDA00031423059100001224
Is 1, and is calculated by the formula
Figure BDA00031423059100001225
tr=0,η2∈M′,η3e.M', where T represents a set of time step variables, η3A global tag representing a unit where an OD end point of a shared bicycle trip is located;
according to the path flow
Figure BDA0003142305910000131
When the strategy execution state tr is equal to 0 at the time step t, when the global label variable eta of the scheduling area unit5And global label eta of unit where OD starting point of shared bicycle trip is located2When the same, will share the bicycle path flow
Figure BDA0003142305910000132
The sum of which is used as a global label variable eta of the unit of the scheduling area5Sharing the actual amount of travel of the bicycle
Figure BDA0003142305910000133
The calculation formula is
Figure BDA0003142305910000134
tr=0,η5∈M′,η2∈M′,η3∈M′;
When the strategy execution state variable tr of the time step variable t is equal to 0, the global label variable eta of the scheduling area unit5And global label η of unit where OD end point of shared bicycle trip is located3At the same time, the OD tag variables (η) of the bicycle trips will be shared2,η3) Shared bicycle path flow
Figure BDA0003142305910000135
The sum of which is used as a global label variable eta of the unit of the scheduling area5Sharing the actual suction capacity of the bicycle
Figure BDA0003142305910000136
The calculation formula is
Figure BDA0003142305910000137
tr=0,η5∈M′,η2∈M′,η3∈M′;
When the strategy execution state variable tr of the time step variable t is equal to 0, the bicycle supply amount is shared
Figure BDA0003142305910000138
The number of the shared bicycle vehicles which are rented and parked in the traveling activities of the rider is updated according to the calculation formula
Figure BDA0003142305910000139
wherein ,
Figure BDA00031423059100001310
a shared bicycle supply variable indicating that the policy execution state variable tr is 1 after the scheduling policy has been implemented at time step (t-1),
Figure BDA00031423059100001311
representing t time step time η5The actual bicycle driving amount variable of the shared bicycle,
Figure BDA00031423059100001312
representing t time step η5The shared bicycle actual attraction amount variable;
when the strategy execution state variable tr of the time step variable t is equal to 0, the dispatching vehicle reaches the unit label variable at the (t +1) time step
Figure BDA00031423059100001313
Is calculated by the formula
Figure BDA00031423059100001314
tr=0,i∈I,ηi,0∈M′,ηi,1∈M′,
Figure BDA00031423059100001315
tr=0,i∈I,
Figure BDA00031423059100001316
Wherein m denotes a horizontal direction label variable of the scheduling area unit, h denotes a vertical direction label variable of the scheduling area unit,
Figure BDA00031423059100001317
a starting unit tag variable representing the time step of the dispatch vehicle (t +1),
Figure BDA0003142305910000141
indicating slave η of dispatching vehiclei,0Moving direction variables of six adjacent regular hexagons;
when the time step variable t strategy execution state variable tr is equal to 0, eta5Predicted cumulative increase/decrease of supply amount of (2)
Figure BDA0003142305910000142
Is calculated by the formula
Figure BDA0003142305910000143
wherein ,
Figure BDA0003142305910000144
indicating that the (i-1) th dispatching vehicle is predicted from eta5Picking up the number of shared bicycles, alphawhIndicating arrival of dispatching vehicle etai,1And ηi,1Is of ηwTime dispatching vehicle put on etai,1The ratio of the number of shared bicycles to the number of vehicles in the compartment, ηwRepresenting a fixed set of warehouse locations;
when the strategy execution state tr is equal to 0 at the time step t, after the front (i-1) dispatching vehicle is predicted to implement dispatching, the unit eta5Predicted cumulative increase/decrease of supply amount of (2)
Figure BDA0003142305910000145
At unit η5In (i-1) th dispatching vehicle is not related to unit eta in the predicted implementation dispatching strategy5I.e. by
Figure BDA0003142305910000146
And is
Figure BDA0003142305910000147
Estimated cumulative increase or decrease
Figure BDA0003142305910000148
Is 0. According to the formula, if the (i-1) th dispatching vehicle predicts the slave unit eta5Pick up
Figure BDA0003142305910000149
Number of vehicles, unit η5Predicted cumulative increase/decrease of supply amount of (2)
Figure BDA00031423059100001410
Reduction of
Figure BDA00031423059100001411
If the (i-1) th dispatching vehicle is expected to be placed
Figure BDA00031423059100001412
Number of vehicles to unit eta5And unit η5Location set η for which the tag value does not belong to a fixed warehouse in a citywThen unit η5Predicted cumulative increase/decrease of supply amount of (2)
Figure BDA00031423059100001413
Increase of
Figure BDA00031423059100001414
If the (i-1) th dispatching vehicle is expected to be placed
Figure BDA00031423059100001415
Number of vehicles to unit eta5And unit η5The tag value belongs to a location set eta of a fixed warehouse of a citywThen unit η5Predicted cumulative increase/decrease of supply amount of (2)
Figure BDA00031423059100001416
Increase of
Figure BDA00031423059100001417
Position set eta of fixed warehouse in citywWhen the collection is empty, the default is to not consider the case of the urban fixed warehouse.
When the strategy execution state variable tr is equal to 0 at the time step t, the dispatching vehicle slave etai,0Will be provided with
Figure BDA00031423059100001418
The shared bicycles are picked up and put into the cabin of the dispatching vehicle and are to be put into
Figure BDA00031423059100001419
All the shared bicycles are put on etai,1Number of vehicles picked up by dispatching vehicle
Figure BDA00031423059100001420
Is calculated by the formula
Figure BDA00031423059100001421
tr=0,i∈I,
Figure BDA00031423059100001422
η5E.g. M', and
Figure BDA00031423059100001423
tr=0,i∈I,
Figure BDA00031423059100001424
wherein min (-) represents taking the minimum value,
Figure BDA00031423059100001425
indicates the supply amount when the policy execution state variable tr is 0, ηi,0A starting unit tag variable representing a dispatch vehicle,
Figure BDA00031423059100001426
represents the maximum capacity of the cabin of the dispatching truck,
Figure BDA0003142305910000151
a dispatch ratio variable representing a dispatch vehicle;
Figure BDA0003142305910000152
representing according to scheduling policy
Figure BDA0003142305910000153
Request to pick up the current cell
Figure BDA0003142305910000154
Of supply amount of
Figure BDA0003142305910000155
Percentage number of bicycles.
Figure BDA0003142305910000156
Means for indicating a time state where tr is 0 after scheduling is performed by a scheduled vehicle (i-1) before an assumed time step t
Figure BDA0003142305910000157
The remaining supply amount of (c). Shared bicycle number picking up vehicle number
Figure BDA0003142305910000158
Should be thatNumber of vehicles, remaining supply and cabin capacity to be picked up based on scheduling policy
Figure BDA0003142305910000159
Is a minimum value of (1), and is an integer. The whole formula is
Figure BDA00031423059100001510
A non-negative constraint.
When the strategy execution state variable tr is 1 at the time step t, the number of vehicles picked up according to the dispatching vehicle
Figure BDA00031423059100001511
Executing the scheduling policy and updating η5To obtain η after implementing the scheduling policy5Shared bicycle supply variable
Figure BDA00031423059100001512
The calculation formula is
Figure BDA00031423059100001513
When the time step t and the time state tr are equal to 1, the dispatching vehicle i carries out dispatching
Figure BDA00031423059100001514
Rear unit η5Amount of supply of
Figure BDA00031423059100001515
At unit η5In the middle, the dispatching vehicle i implements the dispatching strategy without involving the unit eta5I.e. by
Figure BDA00031423059100001516
And is
Figure BDA00031423059100001517
Time, supply amount
Figure BDA00031423059100001518
And is not changed. If it is adjustedDegree vehicle i slave unit eta5Pick up
Figure BDA00031423059100001519
Number of vehicles, unit η5Amount of supply of
Figure BDA00031423059100001520
Reduction of
Figure BDA00031423059100001521
If the dispatching car i is placed
Figure BDA00031423059100001522
Number of vehicles to unit eta5And unit η5Location set η for which the tag value does not belong to a fixed warehouse in a citywThen unit η5Amount of supply of
Figure BDA00031423059100001523
Increase of
Figure BDA00031423059100001524
Otherwise when the unit eta5The tag value belongs to a location set eta of a fixed warehouse of a citywTime, unit eta5Amount of supply of
Figure BDA00031423059100001525
Increase of
Figure BDA00031423059100001526
And is
Figure BDA00031423059100001527
Number of shared bicycles is placed to cell η by default5In urban fixed warehouses.
Total number of shared bicycles Z stored in fixed warehouses in citieswarehouseIs calculated in a manner that
Figure BDA00031423059100001528
The shared bicycle scheduling optimization problem assumes two conditions. Suppose the condition one: the invention assumes that each dispatching vehicle sequentially implements dispatching strategies according to the serial number of the dispatching vehicle. The riders rent the shared bicycles according to the current supply quantity of the shared bicycles in the current unit, the decision maker makes a scheduling strategy based on the supply and demand environment after the travel is finished, and then implements the scheduling strategy and updates the supply and demand environment. The policy execution state variable tr divides the time step into two time states, namely a time state for updating the travel of the rider and making a scheduling policy and a time state for implementing the scheduling policy. When tr is 0, the rider rents, uses and parks the shared bicycle, updates the supply and demand change of each unit of the time step t, and generates a scheduling strategy based on the supply and demand condition after the trip; and when tr is 1, implementing the scheduling strategy and updating the supply and demand environment under the influence of the scheduling strategy. Assume the second condition: to ensure that the dispatching vehicle does not travel to an area outside the area, the present invention assumes that the dispatching vehicle will stay at the current location when this occurs. That is, when the dispatching vehicle is about to reach the region beyond the region at the time step t +1 according to the dispatching strategy, the dispatching strategy is updated to a unit for enabling the dispatching vehicle to reach at the time step t +1
Figure BDA0003142305910000161
For the cell in which time step t is located
Figure BDA0003142305910000162
Under this assumption, under policy
Figure BDA0003142305910000163
In the following, the first and second parts of the material,
Figure BDA0003142305910000164
and
Figure BDA0003142305910000165
the relationship needs to be satisfied simultaneously
Figure BDA0003142305910000166
And
Figure BDA0003142305910000167
tr=0,i∈I,
Figure BDA0003142305910000168
in the shared bicycle scheduling optimization problem, the length of the scheduling period can be controlled by the setting of the set of time steps. The scheduling cycle of the invention for sharing the bicycle short-term scheduling optimization problem is set as one day, namely Tmax143, the time set T ═ 0, 1. In the conventional scheduling method, the scheduling period is usually one day. However, in practical situations, the problems of uneven distribution and loss of demand of shared bicycles will become more serious as time increases due to the limited number of dispatch vehicles. Especially in the later period of the operation process, since the distribution of the shared bicycle vehicles is more unbalanced, the effective strategy can be more challenging to make. Aiming at the problem of long-term operation scheduling, the invention also considers the problem of long-period dynamic scheduling optimization of the shared bicycles. The scheduling cycle of the invention for defining the long-term scheduling optimization problem of the shared bicycles is 7 days, Tmax1007, the time set is T {0, 1.
The present invention is intended to further reduce the number of shared bicycles which are excessively idle on urban roads, while satisfying the object of increasing the number of shared bicycles as much as possible. Therefore, the invention provides a dynamic scheduling optimization problem of shared bicycles comprising urban warehouses. In this problem, the present invention assumes that there is a fixed warehouse in the city and that the redundant bicycles can be stored. During a dispatch operation, the dispatch cart may be moved to a unit built in the warehouse and the stored bicycles in the cart bay are placed in the urban warehouse. Wherein, the position set eta of the fixed warehouse in the citywWhen the shared bicycle dispatching optimization problem is empty, the urban fixed condition is not considered, namely, redundant shared bicycle vehicles cannot be thrown into the urban fixed warehouse in the dispatching process of the dispatching vehicle. Conversely, when the position set eta of the fixed warehouse of the citywWhen not empty, the shared bicycle scheduling optimization problem is that the default exists in fixed citiesAnd a storage for storing excess free shared bicycle vehicles.
In the shared bicycle scheduling optimization problem, a policy execution state variable tr belongs to {0,1}, a dispatcher vehicle label set I ═ 0, 1.. multidot.N }, and a dispatcher vehicle moving direction variable set kappa of a dispatcher vehicle1A set of scheduling rate variables k, {0, 1.., 5}, a set of scheduling rate variables k2(0, 0.25,0.5, 0.75), unit tag set M' ═ 0,12-1)}. According to the variable definition of the sharing bicycle short-term scheduling optimization problem, under the condition of considering the implementation of the scheduling strategy, the sharing bicycle travel flow conservation of each unit can be ensured by the constraint conditions of the travel volume and the actual attraction volume of the actual unit.
In the constructed short-term dispatch optimization problem for shared bikes, the objective function is to maximize the increased total throughput of shared bikes in the area through which the dispatch vehicle passes, as compared to the case where no dispatch strategy is implemented. The decision variables are the action decisions of the dispatching car, including the moving direction of the dispatching car to the unit and the number of bicycles to be dispatched. The constraint conditions are that the total number of shared bicycles is conserved, the relation between the riding travel path flow and the riding OD flow is conserved, and the flow is not negative and is constrained by integers in the scheduling process. When the travel demand generated in the unit is greater than the shared bicycle available in the unit, the excess demand will be considered a lost demand.
In the embodiment of the present invention, step S4 includes the following sub-steps:
s41: determining elements of a shared bicycle dispatching frame based on a vehicle dispatching optimization model of the shared bicycle;
s42: determining average action by utilizing a one-hot coding mode;
s43: defining experience pool variables and training round related variables of a shared bicycle dispatching frame;
s44: and constructing the shared bicycle dispatching frame according to the elements, the average action, the experience pool variable and the training round related variable of the shared bicycle dispatching frame based on an average field theory.
The invention provides a shared bicycle dispatching framework of multi-agent reinforcement learning based on an average field theory according to the proposed shared bicycle dispatching optimization problem, and aims to enable an agent to change the learning riding requirement, adapt to a dynamic environment with randomness, realize dynamic decision optimization with collaboration and increase riding output.
In the embodiment of the invention, in step S41, the invention combines the shared bicycle transfer process model and the multiple intelligent reinforcement learning algorithm to construct a bicycle-sharing vehicle scheduling model. The invention defines that I is an intelligent agent label set and is equal to a label set of a dispatching car, S is a state set, AiTo represent
Figure BDA0003142305910000171
P is a transition probability function, R is a reward function and γ is a discount factor. Then the MDP-based reinforcement learning model contains six elements: g ═ G (I, S, a, P, R). Wherein I represents a dispatcher vehicle label variable and is equivalent to a label variable of an agent in the reinforcement learning algorithm, and I belongs to I ═ 0, 1.
Elements of a shared bicycle dispatch framework include states
Figure BDA0003142305910000172
Behavior parameter atAnd a reward function, wherein,
Figure BDA0003142305910000173
Figure BDA0003142305910000174
i is 0, N represents the state of the dispatch vehicle at time step variable t,
Figure BDA0003142305910000175
i is 0, and N represents a scheduling strategy of a scheduling vehicle at a time step variable t;
the reward function comprises a reward function for actually increasing the traffic of the dispatching vehicle
Figure BDA0003142305910000181
Average increase traffic rewarding function of dispatching vehicle
Figure BDA0003142305910000182
And dispatching vehicle overall increase go function
Figure BDA0003142305910000183
The concrete formula is as follows:
Figure BDA0003142305910000184
Figure BDA0003142305910000185
Figure BDA0003142305910000186
wherein ,αrwThe scaling factor of the reward function is represented,
Figure BDA0003142305910000187
when indicating implementation of scheduling policy
Figure BDA0003142305910000188
The actual amount of the business trip of (c),
Figure BDA0003142305910000189
indicating when no scheduling policy is implemented
Figure BDA00031423059100001810
The actual amount of the business trip of (c),
Figure BDA00031423059100001811
when indicating implementation of scheduling policy
Figure BDA00031423059100001812
The actual amount of the business trip of (c),
Figure BDA00031423059100001813
show no tone being implementedDegree of strategy
Figure BDA00031423059100001814
The actual amount of the business trip of (c),
Figure BDA00031423059100001815
representing the time step variable tth
Figure BDA00031423059100001816
The number of vehicles to be scheduled in the interior,
Figure BDA00031423059100001817
representing the time step variable tth
Figure BDA00031423059100001818
The number of vehicles to be scheduled in the interior,
Figure BDA00031423059100001819
indicating when scheduling policy is implemented η5The actual amount of the business trip of (c),
Figure BDA00031423059100001820
indicating when scheduling policy is not implemented η5N represents the maximum value of the tag variable of the dispatching car, η5A global label representing each scheduling region unit, and M' represents a unit label set of the scheduling region unit.
Status of state
Figure BDA00031423059100001821
In (1),
Figure BDA00031423059100001822
i is 0, and N represents the state of the dispatching vehicle at the time step variable t; the present invention assumes a state at time t
Figure BDA00031423059100001823
The supply quantity of the cell containing the agent i and the position number of the cell
Figure BDA00031423059100001824
The behavior parameter refers to the joint action of the dispatching strategy of dispatching vehicles at time t and satisfies at∈A=A0×A1×...×AN. A is
Figure BDA00031423059100001825
Spatial set A ofiSet of vectors of (a).
Figure BDA00031423059100001826
The action strategy of agent i is equal to the dispatching strategy of dispatching vehicle, i.e.
Figure BDA00031423059100001827
Agent i refers to each dispatch vehicle tag in the city,
Figure BDA00031423059100001828
the method is an instant evaluation of the state and the generated action given by the environment in the interaction process of the agent i and the environment. The goal of agent i is to find the maximum reward
Figure BDA00031423059100001829
The present invention considers three ways of reward functions.
The reward function is an immediate evaluation of the state and the generated action given by the environment during the interaction between the agent i and the environment. The goal of agent i is to find the maximum prize value. Based on the shared bicycle scheduling problem, the present invention defines the variables used in the calculation of the reward function as follows:
αrw-reward function scaling factor, dimensionless;
Figure BDA0003142305910000191
time step t, unit when no scheduling strategy is implemented
Figure BDA0003142305910000192
The actual running amount of the system is dimensionless;
Figure BDA0003142305910000193
time step t, unit when no scheduling strategy is implemented
Figure BDA0003142305910000194
The actual running amount of the system is dimensionless;
Figure BDA0003142305910000195
-time step t, unit
Figure BDA0003142305910000196
The number of internal dispatching vehicles is dimensionless;
Figure BDA0003142305910000197
-time step t, unit
Figure BDA0003142305910000198
The number of internal dispatching vehicles is dimensionless;
in the shared bicycle scheduling problem, the invention considers the reward function of three modes and takes the reward function as
Figure BDA0003142305910000199
In
Figure BDA00031423059100001910
The function of the reward that can be selected,
Figure BDA00031423059100001911
(1) increased amount of travel reward function obtained by the agent: the invention defines an Increased issued quantity (PA) reward function Obtained by an Agent, which is referred to as a PA reward function and is formed by
Figure BDA00031423059100001912
And (4) showing. Which represents the increased amount of shared bicycle travel that each agent obtains after performing an action. In the PA reward function, the reward in a unit that the agent has moved is considered to be the reward earned by the agent. The setting of the PA bonus function may result in the agent focusing on certain unit schedules.
(2) Average incremental bid reward function obtained by agent: the present invention defines an Average Increased Trip amount (APA) reward function Obtained by an Agent and denominates the APA reward function, comprising
Figure BDA00031423059100001913
And (4) showing. The APA reward function refers to the average increased number of shared bicycle trips that an agent obtains after performing an action.
Figure BDA00031423059100001914
Defined as the unit η through which the vehicle is dispatchedi,0 and η i,1 executing scheduling policy
Figure BDA00031423059100001915
The resulting increase in mean stroke yield.
(3) Global increased throughput obtained by agent: the invention defines an Obtained globally Increased Trip amount (accessed dependent created by Agent of Total Units, APTU) reward function of an intelligent Agent, and the function is named as APTU reward function, which is formed by
Figure BDA00031423059100001916
And (4) showing. The APTU reward function refers to the total area-wise increased shared bicycle traffic that all agents obtain after performing a joint action.
State transition probabilities refer to the state of each agent being updated as the time steps progress backwards, based on the combined actions and environmental interactions performed by the agents.
In the embodiment of the present invention, in step S42, a specific method for determining the average action is as follows: scheduling strategy for rewriting scheduling vehicle by using one-hot coding mode
Figure BDA00031423059100001917
Get the average motion
Figure BDA00031423059100001918
The calculation formulas are respectively as follows:
Figure BDA00031423059100001919
Figure BDA00031423059100001920
wherein ,
Figure BDA0003142305910000201
variable denoted 0 or 1, pdimRepresents the dimension of the dispatching strategy, N represents the maximum value of the dispatching vehicle label variable,
Figure BDA0003142305910000202
represents ineAction policy of ineTag variables representing other dispatchers than dispatcher i.
The present invention rewrites the inclusion of moving direction according to one-hot coding mode
Figure BDA0003142305910000203
And launch rate
Figure BDA0003142305910000204
Motion vector of
Figure BDA0003142305910000205
The actions are averaged for agent i when considering the remaining agent actions.
Joint action based on traditional multi-agent deep reinforcement learning algorithmAs atSatisfy the requirement of
Figure BDA0003142305910000206
Combined action atHas a dimension of (N +1) pdimThe expansion of the intelligent agent with the increase of the number of the intelligent agents can lead to the problems of network complexity, calculation efficiency reduction, strategy optimization effect reduction and the like of reinforcement learning.
However, in the MFMARL algorithm, the joint mean action
Figure BDA0003142305910000207
Dimension of (d) is ρdim
Figure BDA0003142305910000208
Dimension of (d) is ρdim
Figure BDA0003142305910000209
Middle movement
Figure BDA00031423059100002010
Has a dimension of 2 ρdim. Therefore, the dimension of the joint action can be controlled according to the joint action processed in the MF mode, and the calculation efficiency is ensured. In particular, in a complex simulation context, where the number of agents is typically large, employing joint averaging actions based on MF theory may alleviate the problem of dimension explosion of joint actions caused by an increase in the number of agents.
In an embodiment of the present invention, the experience pool variables of the shared bicycle dispatching frame include an experience pool in step S43
Figure BDA00031423059100002011
And empirical tank capacity
Figure BDA00031423059100002012
The training-round related variables comprise a training-round number Episode and an updated training-round number EpisodeupnetThe updated weight coefficient omega of the target network and the accumulated return discount factor gamma.
In the embodiment of the present invention, step S44 includes the following sub-steps:
s441: initializing an experience pool
Figure BDA00031423059100002013
Setting empirical tank capacity
Figure BDA00031423059100002014
Target network updated weight coefficient omega and reward function scaling coefficient alpharwAccumulated reward discount factor gamma, initial given offer amount
Figure BDA00031423059100002015
Shared bicycle travel demand variable
Figure BDA00031423059100002016
And sharing bicycle slave η2Go out and arrive at3Travel flow rate of
Figure BDA00031423059100002017
And based on the number of training rounds EpisodeupnetCircularly performing the steps S442-S445;
s442: updating the shared bicycle running environment when the strategy execution state variable tr is 0;
s443: updating the status of each dispatching vehicle
Figure BDA00031423059100002018
And scheduling policy
Figure BDA00031423059100002019
S444: updating the shared bicycle running environment when the strategy execution state variable tr is 1;
s445: updating the status of the next time step for each dispatch vehicle
Figure BDA00031423059100002020
And average motion
Figure BDA00031423059100002021
And updating the increased income of the dispatching vehicle after implementing the dispatching strategy according to the reward function
Figure BDA00031423059100002022
S446: based on the updating process of the steps S442-S445, a shared bicycle dispatching frame is constructed by using a reinforcement learning algorithm, and shared bicycle dispatching is completed by using the shared bicycle dispatching frame.
Stable environment improvement of multi-agent: during the training process in a multi-agent environment, the strategy of agent i is constantly changing. For agent i, the constantly changing policies of other agents make agent i in an unstable environment. At this time, the transition probability from the state at one time to the state at the next time cannot be guaranteed to be a stable value. I.e. for any
Figure BDA0003142305910000211
Will exist in unstable environment
Figure BDA0003142305910000212
In the case of (a) in (b),
Figure BDA0003142305910000213
and
Figure BDA0003142305910000214
respectively represent t1Time t and2the policy of the agent i at the moment,
Figure BDA0003142305910000215
and
Figure BDA0003142305910000216
respectively, at time t1And time t2State transition probability of agent i.
If the agent i learns the action contents of all agents in the reinforcement learning process, the environment where the agent i is located can be changed into a stable environment. Since policies can be expressed by actions and states, when all intelligence is presentWhen the state and action content of the body are known, at time t1And time t2The state transition probabilities of agent i satisfy the following equations, respectively:
Figure BDA0003142305910000217
Figure BDA0003142305910000218
therefore, the temperature of the molten metal is controlled,
Figure BDA0003142305910000219
and
Figure BDA00031423059100002110
may be considered policy independent. When the strategy of the agent is changing, the transition probability from one state at a moment to the next still has stationarity, i.e. the following formula is still true:
Figure BDA00031423059100002111
therefore, within a model of known collective motion, for arbitrary ones
Figure BDA00031423059100002112
The agent i environment may be improved to a stable environment as shown by the following equation:
Figure BDA00031423059100002113
in this embodiment of the present invention, in step S442, a specific method for updating the shared bicycle operating environment when the policy execution state variable tr is equal to 0 includes: updating and calculating OD label variable (eta) of shared bicycle trip2,η3) Shared bicycle path traffic
Figure BDA00031423059100002114
η5Resulting actual bicycle-shared travel variable
Figure BDA00031423059100002115
η5The actual suction volume of the shared bicycle
Figure BDA00031423059100002116
And supply amount when policy execution state variable tr is 0
Figure BDA00031423059100002117
In step S444, a specific method of updating the shared bicycle operating environment when the policy execution state variable tr is equal to 1 is as follows: updating a calculation behavior parameter atFrom eta of dispatching vehiclei,0Pick up and place on reach ηi,1Shared bicycle number variable
Figure BDA00031423059100002118
And supply amount when policy execution state variable tr is 1
Figure BDA0003142305910000221
In step S446, the reinforcement Learning algorithm adopts a strategy gradient method or a Q-Learning method;
each dispatching vehicle comprises a strategy model and a value model; in the strategy gradient method, a strategy model of each dispatching vehicle comprises a strategy estimation network and a strategy target network; the strategy estimation network inputs the state of the dispatching vehicle by constructing a neural network
Figure BDA0003142305910000222
Output as scheduling policy
Figure BDA0003142305910000223
The strategy target network inputs the state of the dispatching vehicle in the next time step variable by constructing a neural network
Figure BDA0003142305910000224
Scheduling strategy for outputting next time step variable
Figure BDA0003142305910000225
In the strategy gradient method and the Q-Learning method, a value model of each dispatching car comprises a value estimation network and a value target network; the value estimation network is constructed by constructing a neural network with parameters of
Figure BDA0003142305910000226
Input the state of the dispatching car
Figure BDA0003142305910000227
Scheduling policy
Figure BDA0003142305910000228
And average motion
Figure BDA0003142305910000229
Output Q value function QiWherein the Q value function refers to a state-action value function in the reinforcement learning algorithm, which represents the accumulated reward value obtained by the dispatching vehicle; the value target network is obtained by constructing a neural network with parameters of
Figure BDA00031423059100002210
Inputting the state of the next time step variable of the dispatching vehicle
Figure BDA00031423059100002211
Scheduling strategy for next time step variable
Figure BDA00031423059100002212
And average motion of next time step variable
Figure BDA00031423059100002213
Output target Q value function
Figure BDA00031423059100002214
Estimation network and target network of value model, in computing QiAnd
Figure BDA00031423059100002215
the method adopts a forward propagation mode when the parameters are updated, and adopts a backward propagation mode when the parameters are updated, and is similar to the calculation mode of the strategy model.
In the Q-Learning method, the strategy model of the dispatching truck is based on a formula
Figure BDA00031423059100002216
And
Figure BDA00031423059100002217
probability sampling and selecting to obtain action
Figure BDA00031423059100002218
wherein ,
Figure BDA00031423059100002219
representing t-1 time step ineAverage motion of ineA tag variable, ω, representing a dispatcher vehicle other than dispatcher vehicle idThe parameters of the policy are represented by,
Figure BDA00031423059100002220
is QiThe form of the function is expressed as,
Figure BDA00031423059100002221
representing the function of computation of probability of action, AiTo represent
Figure BDA00031423059100002222
Is based on the motion space set
Figure BDA00031423059100002223
Updating
Figure BDA00031423059100002224
Substitution formula
Figure BDA00031423059100002225
According to
Figure BDA00031423059100002226
Probabilistic sampling and obtaining
Figure BDA00031423059100002227
Will be provided with
Figure BDA00031423059100002228
The action is taken as the final selection action of the strategy model of the dispatching vehicle;
for Q-Learning based reinforcement Learning algorithms, the policy model for each agent is based on Q of agent iiWorth obtaining an action
Figure BDA0003142305910000231
And does not include a policy object model. The value model of each agent is divided into its value estimation network and a value target network that is structurally consistent with the value estimation network. Estimated network input global state s of a value modeltLast time step
Figure BDA0003142305910000232
And average motion of last time step
Figure BDA0003142305910000233
Output QiThe value is obtained. The value target network input layer is the global state value s of the next momentt+1And the action value
Figure BDA0003142305910000234
And average motion value
Figure BDA0003142305910000235
And output as
Figure BDA0003142305910000236
The value is obtained. Estimation network and target network of value model, in computing QiAnd
Figure BDA0003142305910000237
the method is a forward propagation operation mode, and the method is a backward propagation mode when parameters are updated.
In the framework, each agent independently contains a policy model for decentralized action execution and a value model containing centralized state action information. The reinforcement Learning algorithm can be divided into two types, namely a strategy gradient method and a Q-Learning method.
If the reinforcement learning algorithm adopts a strategy gradient method, the method will be
Figure BDA0003142305910000238
Store to experience pool
Figure BDA0003142305910000239
And from experience pools
Figure BDA00031423059100002310
Randomly sampling a batch of samples
Figure BDA00031423059100002311
According to the sample
Figure BDA00031423059100002312
Updating the neural network parameters of the estimation network according to the accumulated return discount factor gamma and the loss function updating value, and updating the neural network parameters of the strategy model by using a gradient descent method; number of training rounds per interval EpisodeupnetAccording to the updated weight coefficient omega of the target network, the parameter theta of the strategy model neural network is adjustediParameters of neural network of sum value model
Figure BDA00031423059100002313
Neural network parameters respectively transmitted to corresponding strategy target networks
Figure BDA00031423059100002314
Neural network parameters for a sum value target network
Figure BDA00031423059100002315
wherein ,
Figure BDA00031423059100002316
representing the global state, st+1The global state representing the next time step,
Figure BDA00031423059100002317
a prize value indicative of a dispatch vehicle,
Figure BDA00031423059100002318
denotes the mean motion, st,jRepresenting the global state, s, of the samplet+1,jA global state representing the next time step of the sample,
Figure BDA00031423059100002319
the strategy for representing the sample to be sampled,
Figure BDA00031423059100002320
which represents the average motion of the sampled samples,
Figure BDA00031423059100002321
a reward value representing a dispatch vehicle sampling a sample;
if the reinforcement Learning algorithm adopts the Q-Learning method, the Q-Learning method will be used
Figure BDA00031423059100002322
Store to experience pool
Figure BDA00031423059100002323
And again from experience pools
Figure BDA00031423059100002324
Randomly sampling a batch of samples
Figure BDA00031423059100002325
In a sample
Figure BDA00031423059100002326
In accordance with the cumulative reward discount factor gammaAnd updating the neural network parameters of the value estimation network by the loss function; number of training rounds per interval EpisodeupnetAnd transferring the neural network parameters of the value model to the neural network parameters of the value target network according to the updated weight coefficient omega of the target network.
In the embodiment of the invention, according to the proposed shared bicycle scheduling optimization problem, the shared bicycle scheduling framework based on multi-agent reinforcement learning of an average field theory is proposed, and the aim is to enable the agents to change the learning riding requirements, adapt to a dynamic environment with randomness, and realize dynamic decision optimization with collaboration and increase of riding traffic.
The basic idea of framework construction is as follows:
(1) feasibility for solving shared bicycle scheduling problem by using reinforcement learning algorithm framework
In the constructed shared bicycle scheduling optimization problem, the information of the state of the supply amount at the current time is known, and the historical information of the past time is not required. That is, the status of the supply amount at the present time is related only to the status of the supply amount at the previous time and the policy action to be executed, and is independent of the status of the supply amount at the other time and the decision action condition. Therefore, the state of the supply amount at the time point that shares the bicycle scheduling optimization problem can be considered to have markov property. The shared bicycle scheduling optimization problem satisfies the assumption of markov property that the current state contains all information.
Thus, the shared bicycle scheduling problem can be translated into a Markov decision process. The Markov decision process can be solved through the reinforcement learning framework, so that the reinforcement learning framework is feasible to solve the scheduling optimization problem of the shared bicycles. The reinforcement learning does not need label data, and the self-learning of the high-dimensional mapping relation from the state without the model to the action can be realized. In view of the advantages of the reinforcement learning algorithm, the invention provides a shared bicycle dispatching optimization framework based on the reinforcement learning algorithm.
(2) Method for solving shared bicycle scheduling problem based on reinforced algorithm framework
When the multi-agent reinforcement learning algorithm is used for solving the problem of shared bicycle scheduling control, the problems of reduced learning effect, repeated scheduling of vehicles scheduled in the same area and the like occur, as shown in the following.
First, a multi-agent setting is established in a conventional reinforcement learning algorithm, and for each agent, its strategy is changing, resulting in an unstable environment. The unstable environment violates the markov state transition stability, causing policy estimation errors and reducing the efficiency of or failing to optimize the policy.
In the DQN algorithm, first, its agent i learns and selects the best strategy through the independent Q-learning algorithm. However, in a multi-agent environment, agent i updates the policy independently during the learning process, and does not consider the policies of other agents. I.e. the environment to which any agent i belongs is a non-stationary environment. At different time t, the probability of state transition
Figure BDA0003142305910000241
Not necessarily a stable value. However, the environment that the Q-learning algorithm converges to prove requires that the state transition probability matrix must have some stability. The case of an unstable environment is contrary to this assumption. Secondly, an agent i in the traditional DQN algorithm randomly samples data in an experience pool, and the extracted sample is used as a batch of training data of the neural network. The method can avoid the problem of state correlation in the original data. In a multi-agent environment, however, the strategy for agent i to optimize in the current state may be an ineffective strategy in the next state in the unstable environment. The single agent DQN algorithm is an inefficient learning process for learning processes in which there are invalid strategy samples and increases the likelihood of failure of the empirical replay function. Therefore, in a multi-agent environment, the conventional DQN algorithm cannot guarantee that the value function can converge to the optimal value function.
In a gradient strategy algorithm, in an unstable environment, as the number of agents increases, the variance generated by the algorithm will also increase, and the probability of the strategy being optimized in the correct direction will decrease.
Second, in the multi-agent deep reinforcement learning problem, agents are affected by the environment and other agents. During the learning process, the dimensionality of the joint action of the state-action value function will expand exponentially as the number of agents increases. Each agent estimates its own value function according to the joint strategy, and when the joint action space is large, the learning efficiency and the learning effect are reduced.
Thirdly, when the sensitivity of the estimation network of the reinforcement learning algorithm to the input variable data is not high, if the intelligent agent is influenced by a higher reward value or a higher penalty value, the intelligent agent learns the weight values of the parameters corresponding to the state and the action. This results in the agent being further desensitized to changes in state and action information and reselecting the same action. For a single agent, the condition that the action selection is too single can increase the monotonicity of the content of the learning sample data, and influence the accuracy of the fitting sum of the neural network in the estimation network. Inaccurate estimation of the dispatching vehicle strategy by the reinforcement learning algorithm can cause the dispatching efficiency to be reduced.
Fourthly, in the reinforcement learning algorithm of the multi-agent, the phenomenon that a plurality of dispatching vehicles carry shared bicycle vehicles in the same district can occur in the combined action. Non-cooperative scheduling strategies will result in scheduling inefficiencies even in some units with excessive accumulation of empty borrowable or shared bicycle vehicles due to over-scheduling.
(3) Basic idea of framework construction
According to the existing problems, for the shared bicycle dispatching optimization framework based on multi-agent deep reinforcement learning, the problem of dimension explosion caused by the number of agents in a stable environment needs to be considered, and the problem of efficient cooperative cooperation among the multi-agents is solved. The frame construction idea main points are as follows:
first, the intelligent agent learning structure of the framework:
in the distributed structure-based multi-agent deep reinforcement learning method, reinforcement learning methods can be divided into group reinforcement learning and independent reinforcement learning according to whether an agent considers the state and behavior information of the agent.
If each agent in a multi-agent system can be regarded as an independent single agent without communication capability, i.e. the agent does not consider the strategy selection of other agents in the strategy selection process, it can be called an independent learning type algorithm. In this case, the shared information data can be obtained between the agents by collective communication only after the information is fed back from the external environment. In contrast, in the group learning category algorithm, a plurality of agents are considered to be a combined group, and each agent also considers the strategy selection of other agents in the learning process.
The independent learning method can avoid dimension explosion problem in the communication process caused by the increase of the number of the intelligent agents, and can use a reinforcement learning algorithm under a static environment for reference, but has the defects of low convergence speed and long learning time. The group learning method has the advantages that the intelligent agents can be completely communicated to realize sufficient cooperation, but the search space is large and the learning time is long in the learning process. In order to realize the cooperation and communication between the intelligent agents, the strategy of other intelligent agents is considered in learning, and a group reinforcement learning method is adopted to construct a multi-agent-based deep reinforcement learning algorithm.
Second, the unstable environment of the frame improves:
for the problem of improving the unstable environment of the multi-agent, if the agent i learns the action contents of all agents in the learning process, the environment where the agent i is located can be changed into the stable environment. The state and action content of the agent is set herein to known information to improve the unstable environment.
Third, the dimension explosion problem in the framework caused by the increased number of agents improves:
mean Field gaming Theory (MFT) studies the differentiated gaming of group objects consisting of rational gambling partners. While the agent considers its state, the states of the remaining agents are still considered. The classic case of the mean field game is to train fish stocks to move in a cooperative manner. The fish do not pay attention to the swimming behavior of each fish in the group, but adjust the behavior of the fish in the adjacent area according to the behavior of the fish group. The mean field game theory can describe the behavioral response of the surrounding agents and the behavioral set of all agents by Hamilton-Jacobi-Bellman equation and Fokker-Planck-Kolmogorov equation. A Mean Field game theory-based Multi-Agent relationship Learning (MFMARL) algorithm assumes that the influence of all other agents on an Agent can be represented by a Mean distribution. MFMARL is suitable for the reinforcement learning problem of large-scale agents, and can simplify the interactive calculation among the agents. MFMARL solves the problem of the expansion of the space of the value function due to the increase in the number of agents. Thus, MFMARL is introduced herein in a shared bicycle dispatch framework and defines that each agent has the same discrete action space.
Fourthly, the sensitivity of the reinforcement learning algorithm to the change of the state and the action information is improved in the framework:
in order to improve the over-learning stability in a multi-agent deep reinforcement learning framework and improve the sensitivity of a neural network to the change of state and action information, the framework adopts a one-hot coding mode as the input of the neural network, and a hyperbolic tangent tanh (·) function is used for processing after a reward function value is scaled.
Fifth, the framework improves the efficient collaborative ability between agents:
in order to improve the efficient cooperative ability among the agents, the framework designs that the agent i can learn the action contents of all agents in the learning process, namely, the states and strategies of other agents. In addition, different forms of reward functions in the framework are discussed herein, whose impact on collaborative ability is studied.
Therefore, the invention aims at the problem of shared bicycle scheduling control, improves the stability of the learning environment of the multi-agent, and constructs the shared bicycle scheduling framework of the multi-agent reinforcement learning based on the mean field theory, which is learned by the multi-agent group.
The working principle and the process of the invention are as follows: the invention aims to establish a general frame of multi-agent deep reinforcement learning shared bicycle scheduling based on an average field theory so as to solve the problem of shared bicycle scheduling in a long-term scheduling process, a dynamic environment and a large-scale network. The method considers the stability of the state conversion of the multi-agent deep reinforcement learning algorithm, dimension explosion, the communication efficiency of the agents and the exploration behavior of the agents. A frame of a reinforcement learning algorithm is adopted, a coordinated and effective dynamic strategy is obtained in a high-dimensional action space, so that the travel requirement is met, and idle shared bicycles in a road are reduced. And defining the division of the area units by combining a reinforcement learning basic theory and the research of a shared bicycle dispatching system, and constructing a shared bicycle dispatching optimization model.
Aiming at a high-dimensional multi-main-body action space, a shared bicycle scheduling framework for multi-agent deep reinforcement learning based on an average field theory is provided. The lifting framework can be used to address long-term scheduling, dynamic environments, large-scale and complex networks. The framework does not need to predict the demand in advance or perform data processing, and is not influenced by the computational efficiency and accuracy of demand prediction. And the framework is not the optimal strategy for each time segment, but the overall optimization of the whole scheduling process, which considers the supply and demand changes of the future time segment and the influence of the scheduling decision on the supply and demand of the next time segment.
The invention has the beneficial effects that:
(1) the shared bicycle dispatching optimization method based on reinforcement learning provided by the invention is beneficial to intelligently solving the short-term and long-term dispatching optimization problem of the shared bicycles of a large-scale road network under random and complex dynamic environments. The method does not need to predict the demand in advance or carry out manual data processing, and is not influenced by the calculation efficiency and accuracy of demand prediction. And the method is not an optimal strategy for each time segment, but is an overall optimization method of the whole scheduling process, which considers the supply and demand change of the future time segment and the influence of the scheduling decision on the supply and demand of the next time segment.
(2) The dynamic optimization scheduling strategy provided by the invention can improve the scheduling operation efficiency. The bicycle sharing system has the advantages that the bicycle sharing capacity and the bicycle sharing utilization rate are improved, and the loss of the demands of bicycle sharing users is reduced. The idle rate of shared bicycles in roads is reduced, and the number of idle vehicles which are excessively high and accumulated in certain areas is reduced. The waste of shared resources is reduced, and the phenomenon of urban environment caused by stacking of a large number of idle vehicles is improved.
(3) The actual running amount of the bicycle sharing users is increased, so that the sharing rate in the connection traffic can be improved, and the running efficiency of the public traffic system is improved. The service quality of the shared bicycles is improved, the shared bicycles are encouraged to replace motor vehicles to go out, the urban congestion and the tail gas emission of the motor vehicles are reduced, and the social welfare is increased.
It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Those skilled in the art can make various other specific changes and combinations based on the teachings of the present invention without departing from the spirit of the invention, and these changes and combinations are within the scope of the invention.

Claims (10)

1. A shared bicycle scheduling method based on deep reinforcement learning is characterized by comprising the following steps:
s1: dividing a dispatching area of the shared bicycles to obtain dispatching area units, and determining running environment variables of the shared bicycles;
s2: determining a dispatching variable of the shared bicycle according to the running environment variable of the shared bicycle based on the dispatching area unit;
s3: constructing a bicycle dispatching optimization model of the shared bicycles according to the dispatching variables of the shared bicycles;
s4: the bicycle dispatching method comprises the steps of constructing a shared bicycle dispatching frame by utilizing an average field theory based on a vehicle dispatching optimization model of shared bicycles, and completing shared bicycle dispatching by utilizing the shared bicycle dispatching frame.
2. The deep reinforcement learning-based shared bicycle dispatching method according to claim 1, wherein in step S1, the specific method for dividing the dispatching area of the shared bicycles is as follows: dividing a dispatching area of the shared bicycle into a plurality of same equilateral hexagons as dispatching area units, and defining a global label variable eta of each dispatching area unit5Horizontal, horizontalA direction label variable m and a vertical direction label variable h, which satisfy the following relation:
Figure FDA0003142305900000011
wherein ,η5∈M′,M′={0,1,...,((M+1)2-1) }, M denotes a maximum value of a horizontal direction tag variable or a vertical direction tag variable of a scheduling area unit, and M' denotes a unit tag set of the scheduling area unit;
in step S1, the operation environment variables of the shared bicycles include a time variable and a city fixed warehouse location set variable;
the time variables comprise a time step variable T, a time step variable set T and a maximum value variable T of a time stepmaxWherein T ∈ T, T ═ {0,1max};
The variable of the urban fixed warehouse location set comprises a fixed warehouse location set etaw
3. The deep reinforcement learning-based shared bicycle dispatching method according to claim 1, wherein in step S2, the dispatching variables of the shared bicycles comprise a policy execution state variable class, a supply and demand environment variable class, a riding trip variable class and a dispatching policy variable class;
the policy execution state variable class comprises a policy execution state variable tr, wherein tr belongs to {0,1 };
at time step t, the supply and demand environment variable class comprises a shared bicycle travel demand variable of a scheduling area unit
Figure FDA0003142305900000012
Shared bicycle supply variable of dispatch area unit when policy enforcement state variable tr is 0
Figure FDA0003142305900000013
And scheduling zone units when the policy enforcement state variable tr is 1Shared bicycle supply variable
Figure FDA0003142305900000014
At time step t, the riding trip variable class comprises a global label eta of a scheduling region unit where an OD starting point of the shared bicycle trip is located2And global label eta of scheduling area unit where OD destination of shared bicycle trip is located3OD tag variable (η) for shared bicycle trips23) And OD flow of shared bicycle trip
Figure FDA0003142305900000021
Shared bicycle slave η2Go out and arrive at3Travel flow rate of
Figure FDA0003142305900000022
η5Resulting actual bicycle-shared travel variable
Figure FDA0003142305900000023
and η5The actual suction volume of the shared bicycle
Figure FDA0003142305900000024
At time step t, the scheduling strategy variable class comprises a scheduling vehicle label set I, a scheduling vehicle label variable I and a scheduling vehicle starting unit label variable
Figure FDA0003142305900000025
Dispatch vehicle arrival unit label variable
Figure FDA0003142305900000026
Set kappa of moving direction variables of dispatching vehicle1Scheduling ratio variable set kappa2Dispatching vehicle slave
Figure FDA0003142305900000027
Variable moving direction to six adjacent regular hexagons
Figure FDA0003142305900000028
Dispatch ratio variable for a dispatch vehicle
Figure FDA0003142305900000029
Dispatching strategy of dispatching vehicle
Figure FDA00031423059000000210
Maximum capacity of cabin of dispatching vehicle
Figure FDA00031423059000000211
Dispatching vehicle slave
Figure FDA00031423059000000212
Pick up and place in
Figure FDA00031423059000000213
Shared bicycle number variable
Figure FDA00031423059000000214
Arrival of dispatching vehicle
Figure FDA00031423059000000215
And is
Figure FDA00031423059000000216
Is of ηwTime dispatching vehicle put on etai,1The ratio alpha of the number of the shared bicycles to the number of the vehicles in the vehicle compartmentwhAnd before the dispatching vehicle implements the dispatching strategy, the eta of the dispatching vehicle under the condition that the dispatching strategy is expected to be implemented by the former dispatching vehicle5Predicted cumulative increase/decrease amount of supply amount of (2)
Figure FDA00031423059000000217
Increased revenue after dispatching vehicle implements dispatching strategy
Figure FDA00031423059000000218
And total amount of shared bicycles Z stored in the fixed warehouse of the city at the end of the scheduling cycle timewarehouse
Where, I ═ {0, 1., N }, N denotes the maximum value of the dispatcher tag variable, I ∈ I, κ1={0,1,...,5},κ2={0,0.25,0.5,0.75},
Figure FDA00031423059000000219
4. The method for dispatching bicycles based on deep reinforcement learning of claim 1, wherein in step S3, the bicycle dispatching optimization model of the shared bicycles is specifically:
Figure FDA0003142305900000031
s.t.
Figure FDA0003142305900000032
Figure FDA0003142305900000033
Figure FDA0003142305900000034
Figure FDA0003142305900000035
Figure FDA0003142305900000036
Figure FDA0003142305900000037
Figure FDA0003142305900000038
Figure FDA0003142305900000039
Figure FDA00031423059000000310
Figure FDA00031423059000000311
Figure FDA00031423059000000312
Figure FDA00031423059000000313
Figure FDA00031423059000000314
Figure FDA00031423059000000315
in the vehicle dispatching optimization model, the increased benefit is generated after the dispatching vehicle implements the dispatching strategy
Figure FDA00031423059000000316
Maximizing an objective function as a short-term scheduling optimization problem for shared bikes
Figure FDA00031423059000000317
The calculation formula is
Figure FDA00031423059000000318
Wherein T represents a time step, TmaxA maximum value variable representing a time step, i represents a dispatcher vehicle tag variable, N represents a dispatcher vehicle tag variable maximum value,
Figure FDA0003142305900000041
representing a dispatching strategy of a dispatching vehicle;
when the strategy execution state variable tr of the time step variable t is equal to 0, the action decision is carried out
Figure FDA0003142305900000042
Is calculated by the formula
Figure FDA0003142305900000043
wherein ,
Figure FDA0003142305900000044
express the dispatching car from
Figure FDA0003142305900000045
The moving direction variable of the six adjacent regular hexagons,
Figure FDA0003142305900000046
a dispatch ratio variable representing a dispatch vehicle;
when the strategy execution state variable tr of the time step variable t is equal to 0, the global label variable eta of the scheduling area unit5And a global label eta of a scheduling area unit where the OD starting point of the shared bicycle trip is located2Are the same, are shared fromOD tag variable (eta) of traveling2,η3) Shared bicycle path traffic
Figure FDA0003142305900000047
Is calculated by the formula
Figure FDA0003142305900000048
And is
Figure FDA0003142305900000049
Wherein INT (-) denotes a downward integer value,
Figure FDA00031423059000000410
a shared bicycle travel demand variable representing a dispatch area unit,
Figure FDA00031423059000000411
represents a shared bicycle supply variable in the dispatch area unit when the initial given supply is t-0,
Figure FDA00031423059000000412
represents a shared bicycle supply variable within a dispatch area unit when the policy enforcement state variable tr equals 1,
Figure FDA00031423059000000413
representing shared bicycle slave eta2Go out and arrive at3M' represents a unit label set of a scheduling area unit;
global label eta of scheduling area unit where OD starting point of shared bicycle trip is located2Starting trip flow rate
Figure FDA00031423059000000414
Is 1, and is calculated by the formula
Figure FDA00031423059000000415
Where T represents a set of time step variables, η3A global tag representing a unit where an OD end point of a shared bicycle trip is located;
according to the path flow
Figure FDA00031423059000000416
When the strategy execution state tr is equal to 0 at the time step t, when the global label variable eta of the scheduling area unit5And global label eta of unit where OD starting point of shared bicycle trip is located2When the same, will share the bicycle path flow
Figure FDA00031423059000000417
The sum of which is used as a global label variable eta of the unit of the scheduling area5Sharing the actual amount of travel of the bicycle
Figure FDA00031423059000000418
The calculation formula is
Figure FDA00031423059000000419
When the strategy execution state variable tr of the time step variable t is equal to 0, the global label variable eta of the scheduling area unit5And global label η of unit where OD end point of shared bicycle trip is located3When the same, will share the bicycle path flow
Figure FDA0003142305900000051
The sum of which is used as a global label variable eta of the unit of the scheduling area5Sharing the actual suction capacity of the bicycle
Figure FDA0003142305900000052
The calculation formula is
Figure FDA0003142305900000053
At time step variable t strategyWhen the execution state variable tr is equal to 0, the bicycle supply is shared
Figure FDA0003142305900000054
The number of the shared bicycle vehicles which are rented and parked in the traveling activities of the rider is updated according to the calculation formula
Figure FDA0003142305900000055
wherein ,
Figure FDA0003142305900000056
a shared bicycle supply variable indicating that the policy execution state variable tr is 1 after the scheduling policy has been implemented at time step (t-1),
Figure FDA0003142305900000057
representing t time step time η5The actual bicycle driving amount variable of the shared bicycle,
Figure FDA0003142305900000058
representing t time step η5The shared bicycle actual attraction amount variable;
variation in time steptWhen the strategy execution state variable tr is equal to 0, the dispatching vehicle is used for obtaining the unit label variable to be reached at the (t +1) time step
Figure FDA0003142305900000059
Is calculated by the formula
Figure FDA00031423059000000510
Figure FDA00031423059000000511
Wherein m denotes a horizontal direction label variable of the scheduling area unit, h denotes a vertical direction label variable of the scheduling area unit,
Figure FDA00031423059000000512
indicating and dispatching vehicle(t +1) starting cell tag variable for time step,
Figure FDA00031423059000000513
indicating slave η of dispatching vehiclei,0Moving direction variables of six adjacent regular hexagons;
when the time step variable t strategy execution state variable tr is equal to 0, eta5Predicted cumulative increase/decrease of supply amount of (2)
Figure FDA00031423059000000514
Is calculated by the formula
Figure FDA00031423059000000515
wherein ,
Figure FDA00031423059000000516
indicating that the (i-1) th dispatching vehicle is predicted from eta5Picking up the number of shared bicycles, alphawhIndicating arrival of dispatching vehicle etai,1And ηi,1Is of ηwTime dispatching vehicle put on etai,1The ratio of the number of shared bicycles to the number of vehicles in the compartment, ηwRepresenting a fixed set of warehouse locations;
when the strategy execution state variable tr is equal to 0 at the time step t, the dispatching vehicle slave etai,0Will be provided with
Figure FDA0003142305900000061
The shared bicycles are picked up and put into the cabin of the dispatching vehicle and are to be put into
Figure FDA0003142305900000062
All the shared bicycles are put on etai,1Number of vehicles picked up by dispatching vehicle
Figure FDA0003142305900000063
Is calculated by the formula
Figure FDA0003142305900000064
And is
Figure FDA0003142305900000065
Wherein min (-) represents taking the minimum value,
Figure FDA0003142305900000066
indicates the supply amount when the policy execution state variable tr is 0, ηi,0A starting unit tag variable representing a dispatch vehicle,
Figure FDA0003142305900000067
represents the maximum capacity of the cabin of the dispatching truck,
Figure FDA0003142305900000068
a dispatch ratio variable representing a dispatch vehicle;
when the strategy execution state variable tr is 1 at the time step t, the number of vehicles picked up according to the dispatching vehicle
Figure FDA0003142305900000069
Executing the scheduling policy and updating η5To obtain η after implementing the scheduling policy5Shared bicycle supply variable
Figure FDA00031423059000000610
The calculation formula is
Figure FDA00031423059000000611
Total number of shared bicycles Z stored in fixed warehouses in citieswarehouseIs calculated in a manner that
Figure FDA00031423059000000612
5. The deep reinforcement learning-based shared bicycle scheduling method according to claim 1, wherein the step S4 comprises the following sub-steps:
s41: determining elements of a shared bicycle dispatching frame based on a vehicle dispatching optimization model of the shared bicycle;
s42: determining average action by utilizing a one-hot coding mode;
s43: defining experience pool variables and training round related variables of a shared bicycle dispatching frame;
s44: and constructing the shared bicycle dispatching frame according to the elements, the average action, the experience pool variable and the training round related variable of the shared bicycle dispatching frame based on an average field theory.
6. The deep reinforcement learning-based shared bicycle scheduling method of claim 5, wherein in the step S41, the elements of the shared bicycle scheduling frame comprise states
Figure FDA00031423059000000613
Behavior parameter atAnd a reward function, wherein,
Figure FDA0003142305900000071
represents the state of the dispatching vehicle at the variable t of the time step,
Figure FDA0003142305900000072
representing a scheduling strategy of a scheduling vehicle at a time step variable t;
the reward function comprises a reward function for actually increasing the traffic of the dispatching vehicle
Figure FDA0003142305900000073
Average increase traffic rewarding function of dispatching vehicle
Figure FDA0003142305900000074
And dispatching vehicle overall increase go function
Figure FDA0003142305900000075
The concrete formula is as follows:
Figure FDA0003142305900000076
Figure FDA0003142305900000077
Figure FDA0003142305900000078
wherein ,αrwThe scaling factor of the reward function is represented,
Figure FDA0003142305900000079
when indicating implementation of scheduling policy
Figure FDA00031423059000000710
The actual amount of the business trip of (c),
Figure FDA00031423059000000711
indicating when no scheduling policy is implemented
Figure FDA00031423059000000712
The actual amount of the business trip of (c),
Figure FDA00031423059000000713
when indicating implementation of scheduling policy
Figure FDA00031423059000000714
The actual amount of the business trip of (c),
Figure FDA00031423059000000715
when indicating not to implement scheduling policy
Figure FDA00031423059000000716
The actual amount of the business trip of (c),
Figure FDA00031423059000000717
representing the time step variable tth
Figure FDA00031423059000000718
The number of vehicles to be scheduled in the interior,
Figure FDA00031423059000000719
representing the time step variable tth
Figure FDA00031423059000000720
The number of vehicles to be scheduled in the interior,
Figure FDA00031423059000000721
indicating when scheduling policy is implemented η5The actual amount of the business trip of (c),
Figure FDA00031423059000000722
indicating when scheduling policy is not implemented η5N represents the maximum value of the tag variable of the dispatching car, η5A global label representing each scheduling region unit, and M' represents a unit label set of the scheduling region unit.
7. The deep reinforcement learning-based shared bicycle dispatching method according to claim 5, wherein in the step S42, the specific method for determining the average action is as follows: scheduling strategy for rewriting scheduling vehicle by using one-hot coding mode
Figure FDA00031423059000000723
Get the average motion
Figure FDA00031423059000000724
The calculation formulas are respectively as follows:
Figure FDA00031423059000000725
Figure FDA00031423059000000726
wherein ,
Figure FDA00031423059000000727
variable denoted 0 or 1, pdimRepresents the dimension of the dispatching strategy, N represents the maximum value of the dispatching vehicle label variable,
Figure FDA00031423059000000728
represents ineAction policy of ineTag variables representing other dispatchers than dispatcher i.
8. The deep reinforcement learning-based shared bicycle dispatching method according to claim 5, wherein in the step S43, the experience pool variables of the shared bicycle dispatching framework comprise an experience pool
Figure FDA00031423059000000729
And empirical tank capacity
Figure FDA00031423059000000730
The training-round related variables comprise a training-round number Episode and an updated training-round number EpisodeupnetThe updated weight coefficient omega of the target network and the accumulated return discount factor gamma.
9. The deep reinforcement learning-based shared bicycle scheduling method according to claim 5, wherein the step S44 comprises the following sub-steps:
s441: initializing an experience pool
Figure FDA0003142305900000081
Setting empirical tank capacity
Figure FDA0003142305900000082
Target network updated weight coefficient omega and reward function scaling coefficient alpharwAccumulated reward discount factor gamma, initial given offer amount
Figure FDA0003142305900000083
Shared bicycle travel demand variable
Figure FDA0003142305900000084
And sharing bicycle slave η2Go out and arrive at3Travel flow rate of
Figure FDA0003142305900000085
And based on the number of training rounds EpisodeupnetCircularly performing the steps S442-S445;
s442: updating the shared bicycle running environment when the strategy execution state variable tr is 0;
s443: updating the status of each dispatching vehicle
Figure FDA0003142305900000086
And scheduling policy
Figure FDA0003142305900000087
S444: updating the shared bicycle running environment when the strategy execution state variable tr is 1;
s445: updating the status of the next time step for each dispatch vehicle
Figure FDA0003142305900000088
And average motion
Figure FDA0003142305900000089
And updating the dispatching vehicle entity according to the reward functionIncreased revenue after applying scheduling policy
Figure FDA00031423059000000810
S446: based on the updating process of the steps S442-S445, a shared bicycle dispatching frame is constructed by using a reinforcement learning algorithm, and shared bicycle dispatching is completed by using the shared bicycle dispatching frame.
10. The deep reinforcement learning-based shared bicycle scheduling method of claim 5, wherein in the step S442, the specific method for updating the shared bicycle operating environment when the policy execution state variable tr is 0 is as follows: updating and calculating OD label variable (eta) of shared bicycle trip2,η3) Shared bicycle path traffic
Figure FDA00031423059000000811
η5Resulting actual bicycle-shared travel variable
Figure FDA00031423059000000812
η5The actual suction volume of the shared bicycle
Figure FDA00031423059000000813
And supply amount when policy execution state variable tr is 0
Figure FDA00031423059000000817
In step S444, a specific method for updating the shared bicycle operating environment when the policy execution state variable tr is equal to 1 is as follows: updating a calculation behavior parameter atFrom eta of dispatching vehiclei,0Pick up and place on reach ηi,1Shared bicycle number variable
Figure FDA00031423059000000815
And supply amount when policy execution state variable tr is 1
Figure FDA00031423059000000816
In the step S446, the reinforcement Learning algorithm adopts a policy gradient method or a Q-Learning method;
each dispatching vehicle comprises a strategy model and a value model; in the strategy gradient method, a strategy model of each dispatching vehicle comprises a strategy estimation network and a strategy target network; the strategy estimation network is constructed by a neural network, and the parameter of the neural network is thetaiThe input is the state of the dispatching car
Figure FDA0003142305900000091
Output as scheduling policy
Figure FDA0003142305900000092
The strategy target network is constructed by a neural network with parameters of
Figure FDA0003142305900000093
Inputting the state of the dispatching vehicle at the next time step variable
Figure FDA0003142305900000094
Scheduling strategy for outputting next time step variable
Figure FDA0003142305900000095
In the strategy gradient method and the Q-Learning method, a value model of each dispatching car comprises a value estimation network and a value target network; the value estimation network is constructed by constructing a neural network with parameters of
Figure FDA0003142305900000096
Input the state of the dispatching car
Figure FDA0003142305900000097
Scheduling policy
Figure FDA0003142305900000098
And average motion
Figure FDA0003142305900000099
Output Q value function QiWherein the Q value function refers to a state-action value function in the reinforcement learning algorithm, which represents the accumulated reward value obtained by the dispatching vehicle; the value target network is obtained by constructing a neural network with parameters of
Figure FDA00031423059000000910
Inputting the state of the next time step variable of the dispatching vehicle
Figure FDA00031423059000000911
Scheduling strategy for next time step variable
Figure FDA00031423059000000912
And average motion of next time step variable
Figure FDA00031423059000000913
Output target Q value function
Figure FDA00031423059000000914
In the Q-Learning method, the strategy model of the dispatching truck is based on a formula
Figure FDA00031423059000000915
And
Figure FDA00031423059000000916
probability sampling and selecting to obtain action
Figure FDA00031423059000000917
wherein ,
Figure FDA00031423059000000918
representing t-1 time step ineAverage motion of ineA tag variable, ω, representing a dispatcher vehicle other than dispatcher vehicle idThe parameters of the policy are represented by,
Figure FDA00031423059000000919
is QiThe form of the function is expressed as,
Figure FDA00031423059000000920
representing the function of computation of probability of action, AiTo represent
Figure FDA00031423059000000921
Is based on the motion space set
Figure FDA00031423059000000922
Updating
Figure FDA00031423059000000923
Substitution formula
Figure FDA00031423059000000924
According to
Figure FDA00031423059000000925
Probabilistic sampling and obtaining
Figure FDA00031423059000000926
Will be provided with
Figure FDA00031423059000000927
The action is taken as the final selection action of the strategy model of the dispatching vehicle;
if the reinforcement learning algorithm adopts a strategy gradient method, the method will be
Figure FDA00031423059000000928
Store to experience pool
Figure FDA00031423059000000929
And from experience pools
Figure FDA00031423059000000930
Randomly sampling a batch of samples
Figure FDA00031423059000000931
According to the sample
Figure FDA00031423059000000937
Updating neural network parameters of a value estimation network based on a cumulative return discount factor gamma and a loss function
Figure FDA00031423059000000932
Updating neural network parameter theta of strategy model by gradient descent methodi(ii) a Number of training rounds per interval EpisodeupnetAccording to the updated weight coefficient omega of the target network, the parameter theta of the strategy model neural network is adjustediParameters of neural network of sum value model
Figure FDA00031423059000000933
Neural network parameters respectively transmitted to corresponding strategy target networks
Figure FDA00031423059000000934
Neural network parameters for a sum value target network
Figure FDA00031423059000000935
wherein ,
Figure FDA00031423059000000936
representing the global state, st+1Representing the global state of the next time step, rt iA prize value indicative of a dispatch vehicle,
Figure FDA0003142305900000101
means average ofAction, st,jRepresenting the global state, s, in the samplet+1,jRepresenting the global state of the next time step in the sample,
Figure FDA0003142305900000102
which represents the strategy in the sample of samples,
Figure FDA0003142305900000103
which represents the average motion in the sample of samples,
Figure FDA0003142305900000104
representing a reward value for the dispatch vehicle in the sample;
if the reinforcement Learning algorithm adopts the Q-Learning method, the Q-Learning method will be used
Figure FDA0003142305900000105
Store to experience pool
Figure FDA0003142305900000106
And again from experience pools
Figure FDA0003142305900000107
Randomly sampling a batch of samples
Figure FDA0003142305900000108
According to the sample
Figure FDA0003142305900000109
Neural network parameters of cumulative return discount factor gamma and loss function update value estimation network
Figure FDA00031423059000001010
Number of training rounds per interval EpisodeupnetIn the method, the neural network parameters of the value model are adjusted according to the updated weight coefficient omega of the target network
Figure FDA00031423059000001011
Neural network parameters delivered to value target network
Figure FDA00031423059000001012
In (1).
CN202110744265.2A 2021-04-20 2021-06-30 Shared bicycle scheduling method based on deep reinforcement learning Active CN113326993B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110421814 2021-04-20
CN2021104218142 2021-04-20

Publications (2)

Publication Number Publication Date
CN113326993A true CN113326993A (en) 2021-08-31
CN113326993B CN113326993B (en) 2023-06-09

Family

ID=77425362

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110744265.2A Active CN113326993B (en) 2021-04-20 2021-06-30 Shared bicycle scheduling method based on deep reinforcement learning

Country Status (1)

Country Link
CN (1) CN113326993B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113997926A (en) * 2021-11-30 2022-02-01 江苏浩峰汽车附件有限公司 Parallel hybrid electric vehicle energy management method based on layered reinforcement learning
CN115796399A (en) * 2023-02-06 2023-03-14 佰聆数据股份有限公司 Intelligent scheduling method, device and equipment based on electric power materials and storage medium
CN116307251A (en) * 2023-04-12 2023-06-23 哈尔滨理工大学 Work schedule optimization method based on reinforcement learning
CN116402323A (en) * 2023-06-09 2023-07-07 华东交通大学 Taxi scheduling method
CN116824861A (en) * 2023-08-24 2023-09-29 北京亦庄智能城市研究院集团有限公司 Method and system for scheduling sharing bicycle based on multidimensional data of urban brain platform

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111582469A (en) * 2020-03-23 2020-08-25 成都信息工程大学 Multi-agent cooperation information processing method and system, storage medium and intelligent terminal
CN112068515A (en) * 2020-08-27 2020-12-11 宁波工程学院 Full-automatic parking lot scheduling method based on deep reinforcement learning
CN112348258A (en) * 2020-11-09 2021-02-09 合肥工业大学 Shared bicycle predictive scheduling method based on deep Q network
CN112417753A (en) * 2020-11-04 2021-02-26 中国科学技术大学 Urban public transport resource joint scheduling method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111582469A (en) * 2020-03-23 2020-08-25 成都信息工程大学 Multi-agent cooperation information processing method and system, storage medium and intelligent terminal
CN112068515A (en) * 2020-08-27 2020-12-11 宁波工程学院 Full-automatic parking lot scheduling method based on deep reinforcement learning
CN112417753A (en) * 2020-11-04 2021-02-26 中国科学技术大学 Urban public transport resource joint scheduling method
CN112348258A (en) * 2020-11-09 2021-02-09 合肥工业大学 Shared bicycle predictive scheduling method based on deep Q network

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
IBRAHIM ALTHAMARY等: "A Survey on Multi-Agent Reinforcement Learning Methods for Vehicular Networks", 《2019 15TH INTERNATIONAL WIRELESS COMMUNICATIONS & MOBILE COMPUTING CONFERENCE(IWCMC)》 *
VAN HASSELT等: "Deep Reinforcement Learning with double Q-Learning", 《30TH ASSOCIATION-FOR-THE-ADVANCEMENT-OF-ARTIFICIAL-INTELLIGENCE(AAAI) CONFERENCE ON ARTIFICIAL INTELLIGENCE》 *
樊瑞娜: "共享单车系统的平均场理论与闭排队网络研究", 《中国博士学位论文全文数据库 经济与管理科学辑》 *
涂雯雯等: "A Deep Learing Model for traffic Flow State Classification Based on Smart Phone Sensor Data", 《ARXIV PREPRINT ARXIV》 *
陈佳惠等: "共享单车调度路径优化研究", 《交通科技与经济》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113997926A (en) * 2021-11-30 2022-02-01 江苏浩峰汽车附件有限公司 Parallel hybrid electric vehicle energy management method based on layered reinforcement learning
CN115796399A (en) * 2023-02-06 2023-03-14 佰聆数据股份有限公司 Intelligent scheduling method, device and equipment based on electric power materials and storage medium
CN116307251A (en) * 2023-04-12 2023-06-23 哈尔滨理工大学 Work schedule optimization method based on reinforcement learning
CN116307251B (en) * 2023-04-12 2023-09-19 哈尔滨理工大学 Work schedule optimization method based on reinforcement learning
CN116402323A (en) * 2023-06-09 2023-07-07 华东交通大学 Taxi scheduling method
CN116402323B (en) * 2023-06-09 2023-09-01 华东交通大学 Taxi scheduling method
CN116824861A (en) * 2023-08-24 2023-09-29 北京亦庄智能城市研究院集团有限公司 Method and system for scheduling sharing bicycle based on multidimensional data of urban brain platform
CN116824861B (en) * 2023-08-24 2023-12-05 北京亦庄智能城市研究院集团有限公司 Method and system for scheduling sharing bicycle based on multidimensional data of urban brain platform

Also Published As

Publication number Publication date
CN113326993B (en) 2023-06-09

Similar Documents

Publication Publication Date Title
CN113326993A (en) Shared bicycle scheduling method based on deep reinforcement learning
CN111862579B (en) Taxi scheduling method and system based on deep reinforcement learning
CN108417031B (en) Intelligent parking berth reservation strategy optimization method based on Agent simulation
CN113222463B (en) Data-driven neural network agent-assisted strip mine unmanned truck scheduling method
CN112738752A (en) WRSN multi-mobile charger optimized scheduling method based on reinforcement learning
CN116227773A (en) Distribution path optimization method based on ant colony algorithm
Wang et al. Optimization of ride-sharing with passenger transfer via deep reinforcement learning
Xu et al. Designing van-based mobile battery swapping and rebalancing services for dockless ebike-sharing systems based on the dueling double deep Q-network
CN104537446A (en) Bilevel vehicle routing optimization method with fuzzy random time window
Kiaee Integration of electric vehicles in smart grid using deep reinforcement learning
CN117350424A (en) Economic dispatching and electric vehicle charging strategy combined optimization method in energy internet
CN115759915A (en) Multi-constraint vehicle path planning method based on attention mechanism and deep reinforcement learning
CN117541026B (en) Intelligent logistics transport vehicle dispatching method and system
CN112750298B (en) Truck formation dynamic resource allocation method based on SMDP and DRL
CN114117910A (en) Electric vehicle charging guide strategy method based on layered deep reinforcement learning
CN117592701A (en) Scenic spot intelligent parking lot management method and system
Xu et al. Research on open-pit mine vehicle scheduling problem with approximate dynamic programming
CN116739466A (en) Distribution center vehicle path planning method based on multi-agent deep reinforcement learning
CN117032298A (en) Unmanned aerial vehicle task allocation planning method under synchronous operation and cooperative distribution mode of truck unmanned aerial vehicle
CN115907066A (en) Cement enterprise vehicle scheduling method based on hybrid sparrow intelligent optimization algorithm
CN115187056A (en) Multi-agent cooperative resource allocation method considering fairness principle
CN114611864A (en) Garbage vehicle low-carbon scheduling method and system
Dziubany et al. Optimization of a cpss-based flexible transportation system
CN112561104A (en) Vehicle sharing service order dispatching method and system based on reinforcement learning
CN111652550A (en) Method, system and equipment for intelligently searching optimal loop set

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant