CN113326993B - Shared bicycle scheduling method based on deep reinforcement learning - Google Patents

Shared bicycle scheduling method based on deep reinforcement learning Download PDF

Info

Publication number
CN113326993B
CN113326993B CN202110744265.2A CN202110744265A CN113326993B CN 113326993 B CN113326993 B CN 113326993B CN 202110744265 A CN202110744265 A CN 202110744265A CN 113326993 B CN113326993 B CN 113326993B
Authority
CN
China
Prior art keywords
variable
scheduling
shared bicycle
eta
dispatching
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110744265.2A
Other languages
Chinese (zh)
Other versions
CN113326993A (en
Inventor
肖峰
涂雯雯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southwestern University Of Finance And Economics
Original Assignee
Southwestern University Of Finance And Economics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southwestern University Of Finance And Economics filed Critical Southwestern University Of Finance And Economics
Publication of CN113326993A publication Critical patent/CN113326993A/en
Application granted granted Critical
Publication of CN113326993B publication Critical patent/CN113326993B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/10Geometric CAD
    • G06F30/15Vehicle, aircraft or watercraft design
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • G06F30/27Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0631Resource planning, allocation, distributing or scheduling for enterprises or organisations
    • G06Q10/06315Needs-based resource requirements planning or analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/40Business processes related to the transportation industry
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2111/00Details relating to CAD techniques
    • G06F2111/04Constraint-based CAD
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2111/00Details relating to CAD techniques
    • G06F2111/08Probabilistic or stochastic CAD
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2119/00Details relating to the type or aim of the analysis or the optimisation
    • G06F2119/12Timing analysis or timing optimisation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • General Physics & Mathematics (AREA)
  • Economics (AREA)
  • Strategic Management (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Geometry (AREA)
  • Entrepreneurship & Innovation (AREA)
  • General Health & Medical Sciences (AREA)
  • Marketing (AREA)
  • Tourism & Hospitality (AREA)
  • Artificial Intelligence (AREA)
  • General Business, Economics & Management (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Operations Research (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Development Economics (AREA)
  • Computing Systems (AREA)
  • Game Theory and Decision Science (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Quality & Reliability (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Educational Administration (AREA)
  • Automation & Control Theory (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)

Abstract

The invention discloses a shared bicycle scheduling method based on deep reinforcement learning, which comprises the following steps: s1: dividing a dispatching area of the shared bicycle to obtain a dispatching area unit, and determining a running environment variable of the shared bicycle; s2: determining a scheduling variable of the shared bicycle; s3: constructing a vehicle dispatching optimization model of the shared bicycle; s4: based on a vehicle dispatching optimization model of the shared bicycle, constructing a shared bicycle dispatching frame by utilizing an average field theory, and completing shared bicycle dispatching by utilizing the shared bicycle dispatching frame. The shared bicycle scheduling optimization method based on reinforcement learning is beneficial to intelligently solving the problem of short-term and long-term scheduling optimization of shared bicycles of a large-scale road network under random and complex dynamic environments. The method considers the supply and demand change of the environment and the interaction effect of scheduling decision and the environment in future time, does not need to predict the demand in advance or perform manual data processing, and is not influenced by the calculation efficiency and accuracy of demand prediction.

Description

Shared bicycle scheduling method based on deep reinforcement learning
Technical Field
The invention belongs to the technical field of vehicle scheduling, and particularly relates to a shared bicycle scheduling method based on deep reinforcement learning.
Background
In the prior art, the problem of optimizing bicycle scheduling is generally solved by dividing the scheduling time into different time periods and independently searching the optimal scheduling strategy in each time period based on the division. However, the scheduling policy for the previous time period will affect the supply and demand environment for the next and future time periods. For the time period based isolated policy optimization method, the supply and demand conditions of the future time period and the influence caused by the implemented policy are not considered. Under this method, the optimal strategy in this time period does not necessarily promote the generation of a higher actual trip amount in the future time, and even causes the situation that the actual trip amount in the future is lower. Therefore, the optimal global strategy for the full scheduling time is not necessarily obtained by adopting the isolated strategy optimization method based on the time period.
Disclosure of Invention
The invention aims to solve the problem of shared bicycle scheduling in a long-term scheduling process, a dynamic environment and a large-scale network, and provides a shared bicycle scheduling method based on deep reinforcement learning.
The technical scheme of the invention is as follows: the shared bicycle scheduling method based on deep reinforcement learning comprises the following steps:
s1: dividing a dispatching area of the shared bicycle to obtain a dispatching area unit, and determining a running environment variable of the shared bicycle;
s2: determining a scheduling variable of the shared bicycle according to the running environment variable of the shared bicycle based on the scheduling area unit;
s3: constructing a vehicle dispatching optimization model of the shared bicycle according to dispatching variables of the shared bicycle;
s4: based on a vehicle dispatching optimization model of the shared bicycle, constructing a shared bicycle dispatching frame by utilizing an average field theory, and completing shared bicycle dispatching by utilizing the shared bicycle dispatching frame.
Further, in step S1, the specific method for dividing the scheduling area of the shared bicycle is as follows: dividing a scheduling area of a shared bicycle into a plurality of equilateral hexagons as scheduling area units, and defining a global tag variable eta of each scheduling area unit 5 A horizontal direction label variable m, and a vertical direction label variable h, which satisfy the following relation:
Figure GDA0004179282000000011
wherein ,η5 ∈M′,M′={0,1,...,((M+1) 2 -1) }, M represents the maximum value of the horizontal direction tag variable or the vertical direction tag variable of the scheduling area unit, M' represents the unit tag set of the scheduling area unit;
In step S1, the running environment variables of the shared bicycle comprise time variables and city fixed warehouse position set variables;
the time variable comprises a time step variable T, a time step variable set T and a time step maximum value variable T max Wherein T e T, t= {0,1, T max };
The city fixed warehouse location set variable comprises a fixed warehouse location set eta w
Further, in step S2, the scheduling variables of the shared bicycle include a policy execution state variable class, a supply and demand environment variable class, a riding trip variable class, and a scheduling policy variable class;
the policy execution state variable class includes a policy execution state variable tr, where tr is {0,1};
at time step t, the supply and demand environment variable class includes shared bicycle travel demand variables of the dispatch area unit
Figure GDA0004179282000000021
Shared bicycle supply variable ++of dispatch area unit when policy execution state variable tr=0>
Figure GDA0004179282000000022
And a shared bicycle supply variable of the schedule area unit when the policy execution state variable tr=1 +.>
Figure GDA0004179282000000023
At time step t, the riding trip variable class comprises a global tag eta of a scheduling area unit where an OD starting point of the shared bicycle trip is located 2 Global label eta of scheduling area unit where OD destination point of bicycle travel is shared 3 OD tag variable (eta) sharing bicycle travel 23 ) Sharing OD flow for bicycle travel
Figure GDA0004179282000000024
Sharing bicycle slave eta 2 Go out and reach eta 3 Trip flow ratio +.>
Figure GDA0004179282000000025
η 5 The actual travel variable of the resulting shared bicycle +.>
Figure GDA0004179282000000026
and η5 The actual attraction variable of the shared bicycle +.>
Figure GDA0004179282000000027
At time step t, the scheduling strategy variable class comprises a scheduling vehicle tag set I, a scheduling vehicle tag variable I and a scheduling vehicle starting unit tag variable
Figure GDA0004179282000000028
The dispatcher reaches the unit tag variable->
Figure GDA0004179282000000029
Variable set kappa of movement direction of dispatching truck 1 Scheduling ratio variable set κ 2 The dispatcher is from->
Figure GDA00041792820000000210
Movement direction variable to adjacent six regular hexagons +.>
Figure GDA00041792820000000211
Schedule ratio variable of the scheduler->
Figure GDA00041792820000000212
Scheduling policy of the scheduler>
Figure GDA00041792820000000213
Maximum capacity of the cabin of the dispatching vehicle>
Figure GDA00041792820000000214
The dispatcher is from->
Figure GDA00041792820000000215
Pick up and put in->
Figure GDA00041792820000000216
Is a shared bicycle number variable->
Figure GDA00041792820000000217
Scheduler arrival->
Figure GDA00041792820000000218
And->
Figure GDA00041792820000000219
Belongs to eta w The time scheduling car is put in eta i, 1 ratio alpha of the number of shared self-vehicles to the number of vehicles in the cabin wh Eta in the case where a scheduling policy is expected to have been applied to a conventional scheduling vehicle before the scheduling vehicle applies the scheduling policy 5 An estimated cumulative increase amount variable +.>
Figure GDA00041792820000000220
Increased benefit ∈after the scheduler enforces the scheduling policy>
Figure GDA00041792820000000221
And the total amount Z of shared bicycles stored in the urban fixed warehouse at the end of the scheduled cycle time warehouse
Where i= {0,1,..n }, N represents the maximum value of the tag variable of the dispatching truck, I epsilon I kappa 1 ={0,1,...,5},κ 2 ={0,0.25,0.5,0.75},
Figure GDA0004179282000000031
Further, step S4 comprises the sub-steps of:
s41: determining elements of a shared bicycle scheduling framework based on a vehicle scheduling optimization model of the shared bicycle;
s42: determining average actions by using a one-hot coding mode;
s43: defining experience pool variables and training round related variables of a shared bicycle scheduling framework;
s44: based on the average field theory, the shared bicycle scheduling frame is constructed according to the elements, the average actions, the experience pool variables and the training round related variables of the shared bicycle scheduling frame.
Further, in step S41, the elements of the shared bicycle scheduling frame include a status
Figure GDA0004179282000000032
Behavior parameter a t And a reward function, wherein->
Figure GDA0004179282000000033
Representing the state of the dispatcher at the time step variable t, -, etc. ->
Figure GDA0004179282000000034
A scheduling strategy of a scheduling vehicle when the time step variable t is represented;
the rewarding function comprises a rewarding function for actually increasing the travel amount of the dispatching truck
Figure GDA0004179282000000035
Average increase trip amount rewarding function of dispatching vehicle>
Figure GDA0004179282000000036
And the dispatching truck globally increases the travel function +.>
Figure GDA0004179282000000037
The specific formula is as follows: />
Figure GDA0004179282000000038
Figure GDA0004179282000000039
Figure GDA00041792820000000310
wherein ,αrw Representing the scaling factor of the bonus function,
Figure GDA00041792820000000311
indicating +.>
Figure GDA00041792820000000312
Is to go out and go in- >
Figure GDA00041792820000000313
Indicating +.>
Figure GDA00041792820000000314
Is to go out and go in->
Figure GDA00041792820000000315
Indicating +.>
Figure GDA00041792820000000316
Is used for the actual travel amount of the vehicle,
Figure GDA00041792820000000317
indicating +.>
Figure GDA00041792820000000318
Is to go out and go in->
Figure GDA00041792820000000319
Represents the time step variable t +.>
Figure GDA00041792820000000320
Number of inner dispatch vehicles>
Figure GDA00041792820000000321
Represents the time step variable t +.>
Figure GDA00041792820000000322
Number of inner dispatch vehicles>
Figure GDA00041792820000000323
Representing η when implementing a scheduling policy 5 Is to go out and go in->
Figure GDA00041792820000000324
Represents eta when no scheduling policy is implemented 5 N represents the maximum value of the tag variable of the dispatcher,η 5 the global tag representing each scheduling region unit, M' represents the set of unit tags for the scheduling region unit.
Further, in step S42, the specific method for determining the average motion is as follows: method for rewriting scheduling policy of scheduling vehicle by using one-hot coding mode
Figure GDA00041792820000000325
Get average action +.>
Figure GDA00041792820000000326
The calculation formulas are respectively as follows:
Figure GDA00041792820000000327
Figure GDA0004179282000000041
wherein ,
Figure GDA0004179282000000042
a variable denoted 0 or 1 ρ dim Dimension representing scheduling policy, N represents the maximum value of the variable of the tag of the scheduling car,/for the maximum value of the variable of the tag of the scheduling car>
Figure GDA0004179282000000043
Representing i ne Action policy of i ne A tag variable indicating a different scheduler than the scheduler i.
Further, in step S43, the experience pool variables of the shared bicycle scheduling frame include an experience pool
Figure GDA0004179282000000044
And experience pool capacity->
Figure GDA0004179282000000045
The training round related variables comprise training round number Episode and updating the training round number Episode upnet The weight coefficient omega updated by the target network and the accumulated return discount factor gamma.
Further, step S44 includes the sub-steps of:
s441: initializing an experience pool
Figure GDA0004179282000000046
Setting experience pool capacity +.>
Figure GDA0004179282000000047
Weight coefficient omega updated by target network and rewarding function scaling coefficient alpha rw Accumulated return discount factor gamma, initial given supply +.>
Figure GDA0004179282000000048
Sharing bicycle travel demand variables
Figure GDA0004179282000000049
And sharing bicycle slave eta 2 Go out and reach eta 3 Trip flow ratio +.>
Figure GDA00041792820000000410
And based on the training round number Episode upnet Steps S442-S445 are performed in a loop;
s442: updating the shared bicycle operating environment when the policy execution state variable tr=0;
s443: updating the status of each dispatch car
Figure GDA00041792820000000411
And scheduling policy->
Figure GDA00041792820000000412
S444: updating the shared bicycle operating environment when the policy execution state variable tr=1;
s445: updating the next time step status of each scheduler
Figure GDA00041792820000000413
PeaceAll act on->
Figure GDA00041792820000000414
And updating the increased benefit ++of the scheduler after implementing the scheduling policy according to the reward function>
Figure GDA00041792820000000415
S446: based on the update process of steps S442-S445, a shared bicycle scheduling frame is constructed using the reinforcement learning algorithm, and the shared bicycle scheduling is completed using the shared bicycle scheduling frame.
Further, in step S442, a specific method for updating the shared bicycle operating environment when the policy execution state variable tr=0 is as follows: updating and calculating OD tag variable (eta) of travel of shared bicycle 2 ,η 3 ) Is a shared bicycle path flow
Figure GDA00041792820000000416
η 5 The actual travel variable of the resulting shared bicycle +.>
Figure GDA00041792820000000417
η 5 Is a shared bicycle with a shared actual attraction
Figure GDA00041792820000000418
And the supply amount ++when the policy execution state variable tr=0>
Figure GDA00041792820000000419
In step S444, the specific method for updating the shared bicycle operating environment when the policy execution state variable tr=1 is: updating the computational behavior parameters a t The dispatching vehicle is from eta i,0 Pick up and put in reach eta i,1 Shared bicycle number variable
Figure GDA00041792820000000420
And the supply amount ++when the policy execution state variable tr=1>
Figure GDA0004179282000000051
In step S446, the reinforcement Learning algorithm adopts a strategy gradient method or a Q-Learning method;
each dispatching truck comprises a strategy model and a value model; in the strategy gradient method, a strategy model of each dispatching truck comprises a strategy estimation network and a strategy target network; the strategy estimation network is used for inputting the state of the dispatching vehicle by constructing a neural network
Figure GDA0004179282000000052
The output is scheduling policy->
Figure GDA0004179282000000053
The strategy target network inputs the state of the schedule car in the next time step variable by constructing a neural network>
Figure GDA0004179282000000054
Scheduling strategy for outputting next time step variable>
Figure GDA0004179282000000055
In the strategy gradient method and the Q-Learning method, a value model of each dispatching vehicle comprises a value estimation network and a value target network; the value estimation network is formed by constructing a neural network, and parameters of the neural network are as follows
Figure GDA0004179282000000056
Inputting the state of a dispatcher
Figure GDA0004179282000000057
Scheduling policy->
Figure GDA0004179282000000058
Average actions->
Figure GDA0004179282000000059
Output Q value function Q i Wherein the Q function refers to a state in a reinforcement learning algorithmA state-action value function representing a cumulative prize value attained by the dispatcher; the value target network is formed by constructing a neural network, and the parameters of the neural network are +.>
Figure GDA00041792820000000510
Inputting the state of the next time step variable of the dispatching truck>
Figure GDA00041792820000000511
Scheduling policy for next time step variable +.>
Figure GDA00041792820000000512
And the average action of the next time step variable +.>
Figure GDA00041792820000000513
Output target Q value function +.>
Figure GDA00041792820000000514
In the Q-Learning method, in the strategy model of the dispatching vehicle, the method is based on the formula
Figure GDA00041792820000000515
and />
Figure GDA00041792820000000516
Probability sampling and selecting to obtain action->
Figure GDA00041792820000000517
wherein ,/>
Figure GDA00041792820000000518
Representing t-1 time step i ne Average action of i ne Tag variable ω representing a different scheduler than scheduler i d Representing policy parameters->
Figure GDA00041792820000000519
Is Q i Functional expression form,/->
Figure GDA00041792820000000520
Representing an action probability calculation function, A i Representation->
Figure GDA00041792820000000521
Action space set of (2) and according to +.>
Figure GDA00041792820000000522
Update->
Figure GDA00041792820000000523
Substitution formula
Figure GDA00041792820000000524
According to->
Figure GDA00041792820000000525
Probability sampling and obtaining->
Figure GDA00041792820000000526
Will->
Figure GDA00041792820000000527
Act as final choice of policy model of the dispatching truck; />
If the reinforcement learning algorithm adopts a strategy gradient method, the reinforcement learning algorithm will
Figure GDA00041792820000000528
Store to experience pool->
Figure GDA00041792820000000529
And is from experience pool->
Figure GDA00041792820000000530
A batch of samples is randomly sampled->
Figure GDA00041792820000000531
According to sample->
Figure GDA00041792820000000532
According to the accumulated return discount factor gamma and the loss function updating value, estimating the neural network parameters of the network, and updating the neural network parameters of the strategy model by using a gradient descent method; number of training rounds per interval Episode upnet In the method, parameters theta of the strategy model neural network are updated according to the weight coefficient omega updated by the target network i And parameters of the neural network of the value model +.>
Figure GDA0004179282000000061
Neural network parameters respectively transferred to corresponding policy target networks +.>
Figure GDA0004179282000000062
And neural network parameters of the value target network +.>
Figure GDA0004179282000000063
wherein ,/>
Figure GDA0004179282000000064
Representing global state s t+1 Global status representing next time step,/->
Figure GDA0004179282000000065
Representing the prize value of the dispatcher +.>
Figure GDA0004179282000000066
Represent the average action s t,j Representing the global state of the sampled sample s t+1,j Global state representing next time step of sampling a sample,/->
Figure GDA0004179282000000067
Strategy for representing sampled samples, ++>
Figure GDA0004179282000000068
Representing the average motion of the sampled samples, +.>
Figure GDA0004179282000000069
Representation samplingA prize value for a dispatch car of samples;
if the reinforcement Learning algorithm adopts the Q-Learning method, the method will
Figure GDA00041792820000000610
Store to experience pool->
Figure GDA00041792820000000611
And is further from experience pool->
Figure GDA00041792820000000612
A batch of samples is randomly sampled->
Figure GDA00041792820000000613
In the sample->
Figure GDA00041792820000000614
In the method, the neural network parameters of the value estimation network are updated according to the accumulated return discount factor gamma and the loss function; number of training rounds per interval Episode upnet And (3) transmitting the neural network parameters of the value model into the neural network parameters of the value target network according to the updated weight coefficient omega of the target network.
The beneficial effects of the invention are as follows:
(1) The shared bicycle scheduling optimization method based on reinforcement learning is beneficial to intelligently solving the problem of short-term and long-term scheduling optimization of shared bicycles of a large-scale road network under random and complex dynamic environments. The method does not need to predict the demand in advance or conduct manual data processing, and is not affected by the calculation efficiency and accuracy of demand prediction. And this method is not the optimal strategy for each time period, but is an overall optimization method for the entire scheduling process, which takes into account the supply and demand changes of the future time period and the influence of the scheduling decision on the supply and demand of the next time period.
(2) The dynamic optimization scheduling strategy provided by the invention can improve the scheduling operation efficiency. The travel amount and the utilization rate of the shared bicycle are improved, and the loss amount of the shared bicycle user demand is reduced. The shared bicycle idle rate in the road is reduced, and the number of idle vehicles with excessively high accumulation in certain areas is reduced. The waste of shared resources is reduced, and the phenomenon of urban environment caused by stacking of a large number of idle vehicles is improved.
(3) The actual travel amount of the sharing bicycle users is increased, so that the sharing rate in the connection traffic can be improved, and the operation efficiency of the public transportation system is improved. The service quality of the shared bicycle is improved, the shared bicycle is encouraged to replace a motor vehicle for traveling, urban congestion and motor vehicle exhaust emission are reduced, and social welfare is increased.
Drawings
FIG. 1 is a flow chart of a shared bicycle scheduling method;
fig. 2 is a graph of regional units based on equal hexagonal divisions.
Detailed Description
Embodiments of the present invention are further described below with reference to the accompanying drawings.
Before describing particular embodiments of the present invention, in order to make the aspects of the present invention more apparent and complete, abbreviations and key term definitions appearing in the present invention will be described first:
OD traffic: and the traffic volume between the starting and ending points is indicated. "O" is derived from the English ORIGIN and refers to the departure place of the trip, and "D" is derived from the English DESTINATION and refers to the DESTINATION of the trip.
MFMARL algorithm: mean Field Multi-Agent Reinforcement Learning, multi-agent reinforcement learning algorithm based on the average Field game theory.
As shown in fig. 1, the present invention provides a shared bicycle scheduling method based on deep reinforcement learning, comprising the following steps:
s1: dividing a dispatching area of the shared bicycle to obtain a dispatching area unit, and determining a running environment variable of the shared bicycle;
s2: determining a scheduling variable of the shared bicycle according to the running environment variable of the shared bicycle based on the scheduling area unit;
s3: constructing a vehicle dispatching optimization model of the shared bicycle according to dispatching variables of the shared bicycle;
s4: based on a vehicle dispatching optimization model of the shared bicycle, constructing a shared bicycle dispatching frame by utilizing an average field theory, and completing shared bicycle dispatching by utilizing the shared bicycle dispatching frame.
In the embodiment of the invention, in the sequential decision problem, the interactive influence of the supply and demand environment and the implementation scheduling strategy is considered, and the problem of dynamic scheduling optimization of the shared bicycle is solved. According to the scheduling optimization period duration and whether to consider placing surplus shared self-vehicles to a city fixed warehouse, the scheduling optimization problem can be divided into two: the problem of optimizing the shared bicycle schedule without a fixed warehouse and the problem of optimizing the shared bicycle schedule with a fixed warehouse are considered.
In the scheduling optimization problem of the shared bicycle, the optimization target is not the actual trip amount pursued to be maximized in a single time period, the scheduling high efficiency of a single scheduling vehicle is not pursued, and the global trip amount is maximized through dynamic scheduling strategy optimization with cooperation in the whole scheduling period. Further, after achieving the above objectives, it is contemplated herein that the scheduling policy includes an act of placing excess vehicles into the warehouse in the presence of a city warehouse to achieve a reduction in redundant bicycles in the road.
The invention constructs a scheduling optimization process of the shared bicycle, as shown in fig. 3. In the dynamic dispatching optimization process, the invention considers the renting, riding, parking and dispatching processes of the bicycle and the change conditions of supply and demand. At each time step, each scheduler picks up a certain number of shared bicycles from the unit where it is currently located and loads them into the scheduler's cabin, then the scheduler then travels to the arrival unit and places all the shared bicycles in the cabin at the arrival unit.
In the embodiment of the present invention, as shown in fig. 2, a specific method for scheduling a bicycle is as follows: dividing a scheduling area of a shared bicycle into a plurality of equilateral hexagons as scheduling area units, and defining a global tag variable eta of each scheduling area unit 5 A horizontal direction label variable m, and a vertical direction label variable h, which satisfy the following relation:
Figure GDA0004179282000000081
wherein ,η5 ∈M′,M′={0,1,...,((M+1) 2 -1) }, M represents the maximum value of the horizontal direction tag variable or the vertical direction tag variable of the scheduling area unit, M' represents the unit tag set of the scheduling area unit;
in step S1, the running environment variables of the shared bicycle comprise time variables, city fixed warehouse position set variables and supply parameters;
the time variable comprises a time step variable T, a time step variable set T and a time step maximum value variable T max Wherein T e T, t= {0,1, T max };
The city fixed warehouse location set variable comprises a fixed warehouse location set eta w
Some units in a city may be set as stationary warehouses for the city, and when scheduling measures are implemented, the scheduling vehicle may launch idle shared bicycles into the stationary warehouses for the city of that unit. There is no upper limit to the capacity of the urban stationary warehouse, and bicycles scheduled for vehicle delivery in the warehouse will no longer be shipped out and given to the rider for use. When the unit where the destination of the travel of the rider is located is the city fixed warehouse, the shared bicycle parked by the rider will not be placed in the city fixed warehouse, but remain in the unit, and can still be used by the rider in future time. The set of variable locations of the city fixed warehouse may include the unit locations of the city fixed warehouse within all areas.
The supply parameter includes a first supply coefficient c dis And a second supply coefficient c initial The method comprises the steps of carrying out a first treatment on the surface of the Wherein the first supply coefficient c dis The determining method of (1) comprises the following steps: calculating the demand value of each scheduling area unit in each time step variable according to the shared bicycle demand data, and taking 40 quantiles of the demand value in each 10 minutes in all scheduling area units as a first supply coefficient c dis The method comprises the steps of carrying out a first treatment on the surface of the Second supply coefficient c initial The determining method of (1) comprises the following steps: sharing bicycles within each dispatch area unitAnd a first supply coefficient c dis As the second supply coefficient c initial
It is assumed that the shared bicycle supply amount of each unit is uniformly distributed in the area at the initial time. In order to generalize the research of the influence of the supply quantity on the travel analysis, the invention does not directly give the number of the supply quantity, but determines the supply quantity value according to the relation between the supply quantity and the demand. The present invention defines a first supply coefficient c dis And taking 40 quantiles of the riding demand sequence of all units every 10min for the demand value of each unit calculated according to the demand data at each time step. Here, 40 quantiles are chosen and the mean is not chosen, since the mean of the data is more susceptible to extrema. The 40 quantiles are the 40% number after all the values are arranged from small to large. It can avoid the problem that the analysis result is not generalizable because of the few higher demands in the sequence of riding demands of all units in each time step.
The present invention defines a second supply parameter c initial For sharing the supply quantity of the bicycle and the first supply coefficient c in each unit at the initial time dis Is a ratio of the relationship of (2). In the present invention, the second supply parameter c initial Selected as five values, c initial E {20,50,100,200,500,1000}. The invention defines the supply amount of the shared bicycle in each unit as c when the initial time is defined dis And c initial And take its integer value down.
In the embodiment of the invention, in step S2, the scheduling variables of the shared bicycle include a policy execution state variable class, a supply and demand environment variable class, a riding travel variable class and a scheduling policy variable class;
the policy execution state variable class includes a policy execution state variable tr, where tr is {0,1};
at time step t, the supply and demand environment variable class includes shared bicycle travel demand variables of the dispatch area unit
Figure GDA0004179282000000091
Scheduling area unit when policy execution state variable tr=0Is a shared bicycle supply variable->
Figure GDA0004179282000000092
(representing the number of shared bicycles available) and a shared bicycle supply variable for a dispatch area unit when the policy execution state variable tr=1
Figure GDA0004179282000000093
(indicating the number of shared bicycles available for use);
at time step t, the riding trip variable class comprises a global tag eta of a scheduling area unit where an OD starting point of the shared bicycle trip is located 2 Global label eta of scheduling area unit where OD destination point of bicycle travel is shared 3 OD tag variable (eta) sharing bicycle travel 23 ) Sharing OD flow for bicycle travel
Figure GDA0004179282000000094
Sharing bicycle slave eta 2 Go out and reach eta 3 Trip flow ratio +.>
Figure GDA0004179282000000095
η 5 The actual travel variable of the resulting shared bicycle +.>
Figure GDA0004179282000000096
and η5 The actual attraction variable of the shared bicycle +.>
Figure GDA0004179282000000097
η 2 and η3 The conversion of the horizontal and vertical labels is the same as the label relation of the dispatch area unit; in units eta 2 As a starting unit
Figure GDA0004179282000000098
The sum of the ratios of (1) is expressed as a unit eta 3 Is +.>
Figure GDA0004179282000000099
The sum of the ratios of (2) is 1; when eta 2 =η 5 In units of eta 2 Is +.>
Figure GDA00041792820000000910
The sum is equal to the actual travel amount->
Figure GDA00041792820000000911
When eta 3 =η 5 In units of eta 3 As a destination unit
Figure GDA00041792820000000912
The sum is equal to the actual suction amount->
Figure GDA00041792820000000913
/>
At time step t, the scheduling strategy variable class comprises a scheduling vehicle tag set I, a scheduling vehicle tag variable I and a scheduling vehicle starting unit tag variable
Figure GDA00041792820000000914
The dispatcher reaches the unit tag variable->
Figure GDA00041792820000000915
Variable set kappa of movement direction of dispatching truck 1 Scheduling ratio variable set κ 2 The dispatcher is from->
Figure GDA00041792820000000916
Movement direction variable to adjacent six regular hexagons +.>
Figure GDA00041792820000000917
Schedule ratio variable of the scheduler->
Figure GDA00041792820000000918
Scheduling policy of the scheduler >
Figure GDA00041792820000000919
Maximum capacity of the cabin of the dispatching vehicle>
Figure GDA00041792820000000920
The dispatcher is from->
Figure GDA00041792820000000921
Pick up and put in->
Figure GDA0004179282000000101
Is a shared bicycle number variable->
Figure GDA0004179282000000102
Scheduler arrival->
Figure GDA0004179282000000103
And->
Figure GDA0004179282000000104
Belongs to eta w The time scheduling car is put in eta i,1 The number of shared self-vehicles in the vehicle cabin being a ratio alpha of the number of vehicles in the vehicle cabin wh Eta in the case where a scheduling policy is expected to have been applied to a conventional scheduling vehicle before the scheduling vehicle applies the scheduling policy 5 An estimated cumulative increase amount variable +.>
Figure GDA0004179282000000105
Increased benefit ∈after the scheduler enforces the scheduling policy>
Figure GDA0004179282000000106
And the total amount Z of shared bicycles stored in the urban fixed warehouse at the end of the scheduled cycle time warehouse
Where i= {0,1,..n }, N represents the maximum value of the tag variable of the dispatching truck, I epsilon I kappa 1 ={0,1,...,5},κ 2 ={0,0.25,0.5,0.75},
Figure GDA0004179282000000107
When->
Figure GDA0004179282000000108
When the number is 0 to 5, the adjacent units respectively indicate that the dispatching truck moves to the left lower part, the right side, the left upper part, the left side, the right lower part and the right upper part, and the relation is as follows:
Figure GDA0004179282000000109
Figure GDA00041792820000001010
where M represents a horizontal direction tag variable of a scheduling area unit, h represents a vertical direction tag variable of the scheduling area unit, M' represents a unit tag set of the scheduling area unit, and T represents a time step variable set.
Figure GDA00041792820000001011
and ηi,1 The conversion of the horizontal and vertical labels is the same as the label relation of the dispatch area unit; the schedule ratio variable of the scheduler i may be four percentages, i.e. +. >
Figure GDA00041792820000001012
Unit representing pick up of dispatcher i +.>
Figure GDA00041792820000001013
Shared bicycle number is the unit at this time>
Figure GDA00041792820000001014
Percentage of the number of supplied amounts +.>
Figure GDA00041792820000001015
When cell eta 5 When the number of vehicles of the shared bicycle which is expected to be accumulated and tuned away is larger than the number of places, the accumulated increment amount is expected to be accumulated>
Figure GDA00041792820000001016
Negative, whereas, when the number of vehicles of the shared bicycle for which cumulative lift-off is expected to be smaller than or equal to the number of places, the cumulative lift-off is expected to be increased by +.>
Figure GDA00041792820000001017
Is non-negative.
In the embodiment of the present invention, in step S3, the vehicle scheduling optimization model of the shared bicycle specifically includes:
Figure GDA0004179282000000111
s.t.
Figure GDA0004179282000000112
Figure GDA0004179282000000113
/>
Figure GDA0004179282000000114
Figure GDA0004179282000000115
Figure GDA0004179282000000116
Figure GDA0004179282000000117
Figure GDA0004179282000000118
Figure GDA0004179282000000119
Figure GDA00041792820000001110
Figure GDA00041792820000001111
Figure GDA00041792820000001112
Figure GDA00041792820000001113
Figure GDA00041792820000001114
Figure GDA00041792820000001115
in the vehicle dispatching optimization model, the income increased after the dispatching vehicles implement the dispatching strategy
Figure GDA00041792820000001116
Maximizing objective function as a short-term schedule optimization problem for shared bicycles>
Figure GDA00041792820000001117
The calculation formula is as follows
Figure GDA00041792820000001118
Wherein T represents a time step, T max Maximum variable representing time step, i representing the schedule tag variable, N representing the total number of schedule tag variables, +.>
Figure GDA0004179282000000121
Representing a scheduling strategy of a scheduling vehicle; />
The present invention sets the benefits to maximize the benefits of sharing the bicycle compared to the situation without any scheduling policy. The decision variable is the action decision of the dispatching vehicle
Figure GDA0004179282000000122
Including the direction of movement of the dispatcher->
Figure GDA0004179282000000123
And scheduling ratio
Figure GDA0004179282000000124
When the time step variable t strategy execution state variable tr=0, the decision variable is an action decision
Figure GDA0004179282000000125
Action decision->
Figure GDA0004179282000000126
The calculation formula of (2) is +.>
Figure GDA0004179282000000127
wherein ,/>
Figure GDA0004179282000000128
Representing the slave eta of the dispatching vehicle i,0 A movement direction variable to six adjacent regular hexagons,/->
Figure GDA0004179282000000129
A scheduling ratio variable representing a scheduling vehicle;
when the time step variable t policy execution state variable tr=0, the global tag variable η of the schedule area unit 5 And a global tag eta sharing a scheduling area unit where the OD starting point of the bicycle trip is located 2 Sharing OD tag variable (η) for bicycle travel at the same time 2 ,η 3 ) Is a shared bicycle path flow
Figure GDA00041792820000001210
The calculation formula of (2) is
Figure GDA00041792820000001211
And is also provided with
Figure GDA00041792820000001212
Wherein INT (·) represents a down integer value, < >>
Figure GDA00041792820000001213
Shared bicycle travel demand variable representing dispatch area unit, +.>
Figure GDA00041792820000001214
Representing the shared bicycle supply variable in the dispatch area unit when the initial given supply is t=0,/>
Figure GDA00041792820000001215
A shared bicycle supply variable in the dispatch area unit when representing the policy execution state variable tr=1,/->
Figure GDA00041792820000001216
Representing a shared bicycle slave eta 2 Go out and reach eta 3 M' represents a unit tag set of the scheduling area unit;
Figure GDA00041792820000001217
supply of schedule area unit tag variable for policy enforcement status tr=0 +.>
Figure GDA00041792820000001218
And the generated demand of schedule area unit tag variable +.>
Figure GDA00041792820000001219
In (a) and (b)Smaller values. Path flow- >
Figure GDA00041792820000001220
Tag variable eta for a scheduling area unit not greater than the reference 5 Actual travel quantity and eta 2 Is the starting point and eta 3 Travel flow rate for destination +.>
Figure GDA00041792820000001221
Is an integer of the product of (a).
Sharing bicycle slave eta 2 Go out and reach eta 3 Is a ratio of the travel flow rate of (2)
Figure GDA00041792820000001222
Satisfies the conservation relation between the path flow and the OD flow, and uses the global label eta of the scheduling area unit where the OD starting point of the shared bicycle travel is located 2 Trip flow ratio as origin +.>
Figure GDA00041792820000001223
The sum of (2) is 1, and the calculation formula is +.>
Figure GDA00041792820000001224
Wherein T represents a time step variable set, η 3 Global tags representing units where OD destination points of the shared bicycle travel are located;
according to the path flow
Figure GDA0004179282000000131
When the global tag variable η of the schedule area unit is the time step t policy execution state tr=0 5 And a global tag eta sharing the unit where the OD starting point of the bicycle trip is located 2 When the same, the bicycle path flow is shared +.>
Figure GDA0004179282000000132
Is taken as the global tag variable eta of the dispatch area unit 5 Is the actual travel amount of the shared bicycle +.>
Figure GDA0004179282000000133
The calculation formula is +.>
Figure GDA0004179282000000134
When the time step variable t policy execution state variable tr=0, the global tag variable η of the schedule area unit 5 And a global tag eta sharing the unit where the OD destination of the bicycle trip is located 3 At the same time, the OD tag variable (eta) of the travel of the shared bicycle 2 ,η 3 ) Shared bicycle path flow
Figure GDA0004179282000000135
Is taken as the global tag variable eta of the dispatch area unit 5 Is the actual attraction of the shared bicycle>
Figure GDA0004179282000000136
The calculation formula is as follows
Figure GDA0004179282000000137
When the time step variable t policy execution state variable tr=0, the bicycle supply amount is shared
Figure GDA0004179282000000138
Updating according to the renting and parked sharing bicycle number in the travel activity of the rider, wherein the calculation formula is as follows
Figure GDA0004179282000000139
wherein ,/>
Figure GDA00041792820000001310
A shared bicycle supply variable representing a policy execution state variable tr=1 after a scheduling policy has been applied at time step (t-1), a +.>
Figure GDA00041792820000001311
Represents the time step eta of t 5 Is used for sharing the actual travel variable of the bicycle,/>
Figure GDA00041792820000001312
representing t time step eta 5 Is used for sharing the actual attraction variable of the bicycle;
when the time step variable t policy execution state variable tr=0, the unit tag variable that the scheduler will arrive at (t+1) time step
Figure GDA00041792820000001313
The calculation formula of (2) is
Figure GDA00041792820000001314
Figure GDA00041792820000001315
Wherein m represents a horizontal direction tag variable of the scheduling area unit, h represents a vertical direction tag variable of the scheduling area unit,/o>
Figure GDA00041792820000001316
A start unit tag variable representing the time step of the scheduler (t+1), +.>
Figure GDA00041792820000001317
Representing the slave eta of the dispatching vehicle i,0 Moving to the moving direction variable of six adjacent regular hexagons;
when the time step variable t policy execution state variable tr=0, η 5 The estimated cumulative increase/decrease amount of the supply amount of (2)
Figure GDA0004179282000000141
The calculation formula of (2) is
Figure GDA0004179282000000142
wherein ,
Figure GDA0004179282000000143
indicating that the (i-1) th dispatcher vehicle predicts the slave eta 5 Pick up shared number of bicycle, alpha wh Indicating arrival eta of the dispatcher i,1 And eta i,1 Belongs to eta w The time scheduling car is put in eta i,1 Is a ratio of the number of shared self-vehicles to the number of vehicles in the cabin, eta w Representing a fixed warehouse location set;
when the time step t strategy execution state tr=0, the unit eta after the scheduling is expected to be implemented by the front (i-1) scheduling vehicle 5 The estimated cumulative increase/decrease amount of the supply amount of (2)
Figure GDA0004179282000000144
In cell eta 5 In (i-1) th dispatcher, the proposed implementation of the dispatching strategy does not involve the unit eta 5 I.e. +.>
Figure GDA0004179282000000145
And->
Figure GDA0004179282000000146
At this time, the cumulative increment amount +.>
Figure GDA0004179282000000147
Is 0. If the (i-1) th dispatcher vehicle predicts the slave unit eta according to the formula 5 Pick up->
Figure GDA0004179282000000148
Number of vehicles, unit eta 5 An estimated cumulative increase amount of the supply amount of +.>
Figure GDA0004179282000000149
Decrease->
Figure GDA00041792820000001410
If the (i-1) th dispatcher vehicle is expected to be placed +.>
Figure GDA00041792820000001411
Number of vehicles to Unit eta 5 And unit eta 5 Position set eta of tag value not belonging to city fixed warehouse w Then element eta 5 An estimated cumulative increase amount of the supply amount of +.>
Figure GDA00041792820000001412
Add->
Figure GDA00041792820000001413
If the (i-1) th dispatcher vehicle is expected to be placed +.>
Figure GDA00041792820000001414
Number of vehicles to Unit eta 5 And unit eta 5 The tag value belongs to the position set eta of the city fixed warehouse w Then element eta 5 An estimated cumulative increase amount of the supply amount of +.>
Figure GDA00041792820000001415
Add->
Figure GDA00041792820000001416
Position set eta of city fixed warehouse w When empty, the default is to disregard the situation of the urban fixed warehouse.
At time step t policy execution state variable tr=0, the dispatcher vehicle is driven from η i,0 Will be
Figure GDA00041792820000001417
The shared bicycle of the vehicle is picked up and put into the cabin of the dispatching vehicle, and the +.>
Figure GDA00041792820000001418
The shared bicycles of the vehicles are all put in eta i,1 In the number of vehicles picked up by the dispatching vehicle
Figure GDA00041792820000001419
The calculation formula of (2) is
Figure GDA00041792820000001420
And is also provided with
Figure GDA00041792820000001421
Wherein, min (·) represents taking the minimum, < ->
Figure GDA00041792820000001422
Represents the supply amount, η when the policy execution state variable tr=0 i,0 A start unit tag variable representing a dispatcher, +.>
Figure GDA00041792820000001423
Representing the maximum capacity of the cabin of the dispatcher vehicle, +.>
Figure GDA00041792820000001424
A scheduling ratio variable representing a scheduling vehicle;
Figure GDA0004179282000000151
indicating->
Figure GDA0004179282000000152
Requiring pick-up to account for the current cell->
Figure GDA0004179282000000153
Is->
Figure GDA0004179282000000154
Percentage number of bicycles. />
Figure GDA0004179282000000155
Means +.A means for indicating a time state tr=0 after scheduling is performed by the (i-1) scheduler before the time step t>
Figure GDA0004179282000000156
Is used for the remaining supply of (a). Shared bicycle number pickup vehicle number +.>
Figure GDA0004179282000000157
The number of vehicles that should be picked up based on the dispatch strategy, the remaining supply and the vehicle warehouse capacity +.>
Figure GDA0004179282000000158
And is an integer. The whole formula is +.>
Figure GDA0004179282000000159
Is a non-negative constraint.
When the time step t strategy execution state variable tr=1, the number of vehicles picked up according to the dispatching vehicle
Figure GDA00041792820000001510
Executing the scheduling policy and updating eta 5 Obtaining eta after implementing the scheduling policy 5 Is a shared bicycle supply variable->
Figure GDA00041792820000001511
The calculation formula is as follows
Figure GDA00041792820000001512
When the formula shows that the time step t time state tr=1, the dispatching truck i implements dispatching
Figure GDA00041792820000001513
Rear unit eta 5 Supply amount of->
Figure GDA00041792820000001514
In cell eta 5 In the dispatching vehicle i, the implementation of the dispatching strategy does not relate to the unit eta 5 I.e. +.>
Figure GDA00041792820000001515
And->
Figure GDA00041792820000001516
At the time of supply +.>
Figure GDA00041792820000001517
Is unchanged. If the dispatching vehicle i is from unit eta 5 Pick up->
Figure GDA00041792820000001518
Number of vehicles, unit eta 5 Supply amount of->
Figure GDA00041792820000001519
Decrease->
Figure GDA00041792820000001520
If the dispatching truck i is placed
Figure GDA00041792820000001521
Number of vehicles to Unit eta 5 And unit eta 5 Position set eta of tag value not belonging to city fixed warehouse w Then element eta 5 Supply amount of->
Figure GDA00041792820000001522
Add->
Figure GDA00041792820000001523
Conversely when cell eta 5 The tag value belongs to the position set eta of the city fixed warehouse w When the cell eta 5 Supply amount of->
Figure GDA00041792820000001524
Add->
Figure GDA00041792820000001525
And->
Figure GDA00041792820000001526
The number of shared bicycles is placed to the unit eta by default 5 Is a city fixed warehouse.
Total amount Z of shared bicycles stored in urban fixed warehouse warehouse The calculation mode of (a) is that
Figure GDA00041792820000001527
The shared bicycle schedule optimization problem assumes that there are two conditions. Assume condition one: the invention assumes each dispatching truck and sequentially implements a dispatching strategy according to the dispatching truck number. The cyclist rents the shared bicycle according to the current supply quantity of the shared bicycle in the current unit, and the decision maker is based on the supply and demand after the trip is completed The environment formulates a scheduling policy, then implements the scheduling policy and updates the supply and demand environment. The policy execution state variable tr divides the time step into two time states, namely, a time state for updating the travel of the rider and formulating the scheduling policy, and a time state for implementing the scheduling policy. When tr=0, the rider rents, uses and parks the shared bicycle, updates the supply and demand change of each unit of the time step t, and then generates a scheduling strategy based on the supply and demand conditions after traveling; when tr=1, the scheduling policy is implemented and the supply and demand environment under the influence of the scheduling policy is updated. Assume condition two: to ensure that the dispatch vehicle will not travel to an area outside the area, the present invention assumes that the dispatch vehicle will stay in the current location when this occurs. I.e. when the dispatching vehicle arrives at the zone outside the zone at time step t+1 according to the dispatching strategy, the dispatching strategy will be updated to the unit for the dispatching vehicle to arrive at time step t+1
Figure GDA0004179282000000161
For the unit where time step t is located ∈ ->
Figure GDA0004179282000000162
Under this assumption, in policy +.>
Figure GDA0004179282000000163
Down (S)>
Figure GDA0004179282000000164
And->
Figure GDA0004179282000000165
The relationship needs to be satisfied at the same time>
Figure GDA0004179282000000166
And
Figure GDA0004179282000000167
in the shared bicycle scheduling optimization problem, the length of the scheduling period can be controlled by the setting of the time step set.The invention sets the scheduling period of the shared bicycle short-term scheduling optimization problem as one day, namely T max Time set t= {0,1,..143 }. In the conventional scheduling method, the scheduling period is typically one day. In practice, however, the problems of uneven distribution and loss of demand of the shared bicycle become more serious with the increase of time, due to the limited number of scheduled vehicles. Especially at the later stages of the operation process, it is more challenging to formulate an effective strategy because the shared bicycle distribution is more unbalanced. Aiming at the problem of long-term operation scheduling, the invention also considers the problem of long-period dynamic scheduling optimization of the shared bicycle. The invention defines the dispatching cycle of the shared bicycle long-term dispatching optimization problem to be 7 days, T max Time set t= {0,1,..1007 }.
The present invention is expected to further reduce the number of sharing bicycles that are excessively idle on the urban road, on the basis of satisfying the goal of increasing the travel amount of the sharing bicycles as much as possible. Thus, the invention provides a dynamic scheduling optimization problem of the shared bicycle comprising the urban warehouse. In this problem, the present invention assumes that there is a stationary warehouse in the city and that excess bicycles can be stored. During a dispatch operation, the dispatcher may move to a unit built in the warehouse and place the stored self-vehicles within the bins into the city warehouse. Wherein, when the city is fixed, the position set eta of the warehouse w When the method is empty, the problem of optimizing the shared bicycle dispatching is converted into the condition that city fixing is not considered, namely, the dispatching truck cannot put redundant shared bicycle into a city fixing warehouse in the dispatching process. Conversely, when the city is fixed, the position set eta of the warehouse w When the bicycle is not empty, the problem of shared bicycle scheduling optimization is that a city fixed warehouse exists by default, and redundant idle shared bicycle can be stored.
In the shared bicycle scheduling optimization problem, a policy execution state variable tr epsilon {0,1}, a scheduling car label set i= {0, 1..once, N }, a moving direction variable set κ of a scheduling car 1 = {0,1,..5 }, schedule ratio variable set κ 2 = {0,0.25,0.5,0.75}, set of unit tags M' = {0,1, (|m+1| 2 -1) }. Root of Chinese characterAccording to the variable definition of the shared bicycle short-term dispatching optimization problem, under the condition of considering dispatching strategy implementation, the constraint condition of the actual unit trip amount and the actual attraction amount can ensure the conservation of the shared bicycle trip flow of each unit.
In the constructed short-term scheduling optimization problem of the shared bicycle, the objective function is to maximize the increased total trip amount of the shared bicycle in the area through which the scheduling vehicle passes, compared to the case where no scheduling policy is performed. The decision variables are action decisions of the dispatcher, including the moving direction of the dispatcher to the unit and the number of bicycles to be dispatched. The constraint conditions are conservation of the total number of shared bicycles, conservation of the relation between the riding travel path flow and the riding OD flow, and non-negative and integer constraint of the flow in the scheduling process. When travel demands are generated in the cell that are greater than the available shared bicycle in the cell, the excess demand will be considered a lost demand.
In an embodiment of the present invention, step S4 comprises the sub-steps of:
s41: determining elements of a shared bicycle scheduling framework based on a vehicle scheduling optimization model of the shared bicycle;
s42: determining average actions by using a one-hot coding mode;
s43: defining experience pool variables and training round related variables of a shared bicycle scheduling framework;
s44: based on the average field theory, the shared bicycle scheduling frame is constructed according to the elements, the average actions, the experience pool variables and the training round related variables of the shared bicycle scheduling frame.
According to the invention, a shared bicycle scheduling framework based on multi-agent reinforcement learning of an average field theory is provided according to the provided shared bicycle scheduling optimization problem, and the goal is to enable the agents to change the learning riding demands, adapt to a random dynamic environment, realize cooperative dynamic decision optimization and increase riding travel.
In the embodiment of the present invention, in step S41, the present invention combines the shared bicycle transfer process model and the multi-intelligent reinforcement learning algorithm to constructA vehicle dispatch model for a shared bicycle. The invention defines I as an agent label set and is equivalent to a label set of a dispatching vehicle, S is a state set and A i Representation of
Figure GDA0004179282000000171
P is a transition probability function, R is a reward function and γ is a discount factor. The MDP-based reinforcement learning model includes six elements: g= (I, S, a, P, R). Where I represents the dispatcher tag variable and the tag variable of the agent in the equivalent reinforcement learning algorithm, I e i= {0, 1..n }.
Elements of the shared bicycle scheduling framework include status
Figure GDA0004179282000000172
Behavior parameter a t And a bonus function, wherein,
Figure GDA0004179282000000173
Figure GDA0004179282000000174
representing the state of the dispatcher at the time step variable t,
Figure GDA0004179282000000175
a scheduling strategy of a scheduling vehicle when the time step variable t is represented;
the rewarding function comprises a rewarding function for actually increasing the travel amount of the dispatching truck
Figure GDA0004179282000000181
Average increase trip amount rewarding function of dispatching vehicle>
Figure GDA0004179282000000182
And the dispatching truck globally increases the travel function +.>
Figure GDA0004179282000000183
The specific formula is as follows: />
Figure GDA0004179282000000184
Figure GDA0004179282000000185
Figure GDA0004179282000000186
wherein ,αrw Representing the scaling factor of the bonus function,
Figure GDA0004179282000000187
indicating +.>
Figure GDA0004179282000000188
Is to go out and go in->
Figure GDA0004179282000000189
Indicating +.>
Figure GDA00041792820000001810
Is to go out and go in->
Figure GDA00041792820000001811
Indicating +.>
Figure GDA00041792820000001812
Is used for the actual travel amount of the vehicle,
Figure GDA00041792820000001813
indicating +.>
Figure GDA00041792820000001814
Is to go out and go in->
Figure GDA00041792820000001815
Represents the time step variable t +.>
Figure GDA00041792820000001816
Number of inner dispatch vehicles>
Figure GDA00041792820000001817
Represents the time step variable t +.>
Figure GDA00041792820000001818
Number of inner dispatch vehicles >
Figure GDA00041792820000001819
Representing η when implementing a scheduling policy 5 Is to go out and go in->
Figure GDA00041792820000001820
Represents eta when no scheduling policy is implemented 5 N represents the maximum value of the tag variable of the dispatching vehicle, eta 5 The global tag representing each scheduling region unit, M' represents the set of unit tags for the scheduling region unit.
Status of
Figure GDA00041792820000001821
In (I)>
Figure GDA00041792820000001822
The state of the dispatching vehicle when the time step variable t is represented; the present invention assumes a state +.>
Figure GDA00041792820000001823
Comprising the supply amount of the cell in which the agent i is located and the location number of the cell, i.e., satisfying +.>
Figure GDA00041792820000001824
The behavior parameter refers to the joint action of the scheduling policy for scheduling vehicles at time t and satisfies a t ∈A=A 0 ×A 1 ×...×A N . A is
Figure GDA00041792820000001825
Spatial set A of (2) i Is defined in the set of vectors of (a). />
Figure GDA00041792820000001826
The action strategy of agent i is equal to the dispatching strategy of dispatching car, namely +.>
Figure GDA00041792820000001827
Agent i refers to each dispatch vehicle tag in the city, r t i The method is an instant evaluation of states and generated actions given by the environment in the interaction process of the intelligent agent i and the environment. The goal of agent i is to find the maximized rewards
Figure GDA00041792820000001828
The present invention contemplates three ways of rewarding functions.
The rewarding function refers to an instant evaluation of the status and the generated actions given by the environment in the process of interaction of the agent i with the environment. The goal of agent i is to find the maximum prize value. Based on the shared bicycle scheduling problem, the variables used in the calculation of the defined reward function of the present invention are as follows:
α rw -a bonus function scaling factor, dimensionless;
Figure GDA0004179282000000191
-time step t, element +.>
Figure GDA0004179282000000192
Is dimensionless;
Figure GDA0004179282000000193
-time step t, element +.>
Figure GDA0004179282000000194
Is dimensionless;
Figure GDA0004179282000000195
-time step t, element->
Figure GDA0004179282000000196
The number of internal dispatching vehicles is dimensionless;
Figure GDA0004179282000000197
-time step t, element->
Figure GDA0004179282000000198
The number of internal dispatching vehicles is dimensionless;
in the shared bicycle scheduling problem, the present invention considers three types of reward functions as
Figure GDA0004179282000000199
Middle->
Figure GDA00041792820000001910
Selectable reward function, ++>
Figure GDA00041792820000001911
(1) Increased trip bonus function obtained by the agent: the invention defines an increased trip amount (Increased Trip Production Obtained by Agent, PA) rewarding function obtained by the intelligent agent, which is referred to as PA rewarding function, consisting of
Figure GDA00041792820000001912
And (3) representing. Which represents the increased travel volume of the shared bicycle obtained by each agent after the action is performed by that agent. In the PA bonus function, the rewards within the cell that an agent moves through are all considered rewards that the agent gets. The setting of the PA reward function may result in the agent focusing on certain unit schedules.
(2) Average incremental trip bonus function obtained by agent: the invention defines an average increased travel volume (Average Increased Trip Production Obtained by Agent, APA) reward function obtained by the agent and is named APA reward function, consisting of
Figure GDA00041792820000001913
And (3) representing. The APA reward function refers to the average incremental amount of travel of the shared bicycle that each agent obtains after the agent performs an action. />
Figure GDA00041792820000001914
Defined as the unit eta through which the scheduled vehicle passes i,0 and ηi,1 Execute scheduling policy->
Figure GDA00041792820000001915
The resulting increase in average stroke yield.
(3) Globally increased travel amount obtained by the agent: the invention defines a globally increased trip amount (Average Increased Trip Production Obtained by an Agent of Total Units, APTU) rewarding function obtained by the agent and is named as APTU rewarding function, which is composed of
Figure GDA00041792820000001916
And (3) representing. The APTU reward function refers to the total area of increased travel of the shared bicycle that all agents acquire after performing the joint action.
The state transition probability refers to the state of each agent that will be updated according to the joint actions and environmental interactions performed by the agent as the time step advances backward.
In the embodiment of the present invention, in step S42, the specific method for determining the average action is as follows: method for rewriting scheduling policy of scheduling vehicle by using one-hot coding mode
Figure GDA00041792820000001917
Get average action +.>
Figure GDA00041792820000001918
The calculation formulas are respectively as follows:
Figure GDA00041792820000001919
Figure GDA00041792820000001920
wherein ,
Figure GDA0004179282000000201
a variable denoted 0 or 1 ρ dim Dimension representing scheduling policy, N represents the maximum value of the variable of the tag of the scheduling car,/for the maximum value of the variable of the tag of the scheduling car >
Figure GDA0004179282000000202
Representing i ne Action policy of i ne A tag variable indicating a different scheduler than the scheduler i.
The invention rewrites the moving direction according to the one-hot coding mode
Figure GDA0004179282000000203
And administration ratio->
Figure GDA0004179282000000204
Motion vector of (a)
Figure GDA0004179282000000205
Figure GDA0004179282000000206
The actions are averaged for agent i when considering the remaining agent actions.
Combined action a based on traditional multi-agent deep reinforcement learning algorithm t Satisfy the following requirements
Figure GDA0004179282000000207
Joint action a t Has a dimension of (n+1) ρ dim Which expands as the number of agents increasesThe problems of network complexity, calculation efficiency reduction, strategy optimization effect reduction and the like of reinforcement learning are caused.
However, in the MFMARL algorithm, the joint average action
Figure GDA00041792820000002023
Is p dim ,/>
Figure GDA0004179282000000208
Is p dim
Figure GDA0004179282000000209
Middle action->
Figure GDA00041792820000002010
Is 2 p dim . Therefore, the dimension of the joint action can be controlled according to the joint action processed in the MF mode, and the calculation efficiency is ensured. In particular, in a complex simulation context, the number of agents is typically a large value, and the use of joint average actions based on MF theory may alleviate the problem of dimensional explosions of joint actions caused by the increase in the number of agents.
In an embodiment of the present invention, in step S43, the experience pool variables of the shared bicycle schedule frame include an experience pool
Figure GDA00041792820000002011
And experience pool capacity- >
Figure GDA00041792820000002012
The training round related variables comprise training round number Episode and updating the training round number Episode upnet The weight coefficient omega updated by the target network and the accumulated return discount factor gamma.
In an embodiment of the present invention, step S44 includes the sub-steps of:
s441: initializing an experience pool
Figure GDA00041792820000002013
Setting experience pool capacity +.>
Figure GDA00041792820000002014
Weight coefficient omega updated by target network and rewarding function scaling coefficient alpha rw Accumulated return discount factor gamma, initial given supply +.>
Figure GDA00041792820000002015
Shared bicycle travel demand variable +.>
Figure GDA00041792820000002016
And sharing bicycle slave eta 2 Go out and reach eta 3 Trip flow ratio +.>
Figure GDA00041792820000002017
And based on the training round number Episode upnet Steps S442-S445 are performed in a loop;
s442: updating the shared bicycle operating environment when the policy execution state variable tr=0;
s443: updating the status of each dispatch car
Figure GDA00041792820000002018
And scheduling policy->
Figure GDA00041792820000002019
S444: updating the shared bicycle operating environment when the policy execution state variable tr=1;
s445: updating the next time step status of each scheduler
Figure GDA00041792820000002020
Average actions->
Figure GDA00041792820000002021
And updating the increased benefit ++of the scheduler after implementing the scheduling policy according to the reward function>
Figure GDA00041792820000002022
S446: based on the update process of steps S442-S445, a shared bicycle scheduling frame is constructed using the reinforcement learning algorithm, and the shared bicycle scheduling is completed using the shared bicycle scheduling frame.
Stable environment improvement of multi-agent: during the training process of the multi-agent environment, the strategy of agent i is constantly changing. For agent i, the policy of other agents is constantly changing, so that agent i is in an unstable environment. At this time, the transition probability from the state at one time to the state at the next time cannot be ensured to be a stable value. I.e. for any one
Figure GDA0004179282000000211
Will be present in an unstable environment>
Figure GDA0004179282000000212
In the case of (a) the (b),
Figure GDA0004179282000000213
and />
Figure GDA0004179282000000214
Respectively represent t 1 Time sum t 2 Policy of agent i at moment, +.>
Figure GDA0004179282000000215
And
Figure GDA0004179282000000216
respectively at time t 1 And time t 2 State transition probability of agent i.
If the intelligent agent i knows the action content of all the intelligent agents in the reinforcement learning process, the environment where the intelligent agent i is located can be changed into a stable environment. Since policies can be expressed by actions and states, when the states and action contents of all agents are known, at time t 1 And time t 2 The state transition probabilities of the agent i satisfy the following formulas, respectively:
Figure GDA0004179282000000217
Figure GDA0004179282000000218
thus, the first and second substrates are bonded together,
Figure GDA0004179282000000219
and />
Figure GDA00041792820000002110
It can be considered as policy independent. When the policy of the agent is changing continuously, the transition probability from the state at one moment to the state at the next moment still has stationarity, i.e. the following formula is still true:
Figure GDA00041792820000002111
Thus, for any given model of focused actions
Figure GDA00041792820000002112
The agent i environment can be improved to a stable environment as shown in the following formula: />
Figure GDA00041792820000002113
In the embodiment of the present invention, in step S442, a specific method for updating the shared bicycle operating environment when the policy execution state variable tr=0 is: updating and calculating OD tag variable (eta) of travel of shared bicycle 2 ,η 3 ) Is a shared bicycle path flow
Figure GDA00041792820000002114
η 5 The actual travel variable of the resulting shared bicycle +.>
Figure GDA00041792820000002115
η 5 The actual attraction variable of the shared bicycle +.>
Figure GDA00041792820000002116
And the supply amount ++when the policy execution state variable tr=0>
Figure GDA00041792820000002117
In step S444, the specific method for updating the shared bicycle operating environment when the policy execution state variable tr=1 is: updating the computational behavior parameters a t The dispatching vehicle is from eta i,0 Pick up and put in reach eta i,1 Shared bicycle number variable
Figure GDA00041792820000002118
And the supply amount ++when the policy execution state variable tr=1>
Figure GDA0004179282000000221
In step S446, the reinforcement Learning algorithm adopts a strategy gradient method or a Q-Learning method;
each dispatching truck comprises a strategy model and a value model; in the strategy gradient method, a strategy model of each dispatching truck comprises a strategy estimation network and a strategy target network; the strategy estimation network is used for inputting the state of the dispatching vehicle by constructing a neural network
Figure GDA0004179282000000222
The output is scheduling policy->
Figure GDA0004179282000000223
The strategy target network inputs the state of the schedule car in the next time step variable by constructing a neural network>
Figure GDA0004179282000000224
Scheduling strategy for outputting next time step variable>
Figure GDA0004179282000000225
In the strategy gradient method and the Q-Learning method, a value model of each dispatching vehicle comprises a value estimation network and a value target network; the value estimation network is formed by constructing a neural network, and parameters of the neural network are as follows
Figure GDA0004179282000000226
Inputting the state of a dispatcher
Figure GDA0004179282000000227
Scheduling policy->
Figure GDA0004179282000000228
Average actions->
Figure GDA0004179282000000229
Output Q value function Q i Wherein the Q-value function refers to a state-action value function in a reinforcement learning algorithm that represents a cumulative prize value attained by the scheduler; the value target network is formed by constructing a neural network, and the parameters of the neural network are +.>
Figure GDA00041792820000002210
Inputting the state of the next time step variable of the dispatching truck>
Figure GDA00041792820000002211
Scheduling policy for next time step variable +.>
Figure GDA00041792820000002212
And the average action of the next time step variable +.>
Figure GDA00041792820000002213
Output target Q value function +.>
Figure GDA00041792820000002214
Evaluation network and target network of value model, in calculating Q i and />
Figure GDA00041792820000002215
The running mode of forward propagation is adopted, the backward propagation mode is adopted when the parameters are updated, and the calculation mode of the strategy model is adopted.
In the Q-Learning method, in the strategy model of the dispatching vehicle, the method is based on the formula
Figure GDA00041792820000002216
and />
Figure GDA00041792820000002217
Probability sampling and selecting to obtain action- >
Figure GDA00041792820000002218
wherein ,/>
Figure GDA00041792820000002219
Representing t-1 time step i ne Average action of i ne Tag variable ω representing a different scheduler than scheduler i d Representing policy parameters->
Figure GDA00041792820000002220
Is Q i Functional expression form,/->
Figure GDA00041792820000002221
Representing an action probability calculation function, A i Representation->
Figure GDA00041792820000002222
Action space set of (2) and according to +.>
Figure GDA00041792820000002223
Update->
Figure GDA00041792820000002224
Substitution formula
Figure GDA00041792820000002225
According to->
Figure GDA00041792820000002226
Probability sampling and obtaining->
Figure GDA00041792820000002227
Will->
Figure GDA00041792820000002228
Act as final choice of policy model of the dispatching truck; />
For the reinforcement Learning algorithm based on Q-Learning, the policy model of each agent is based on the Q of agent i i The value gets the action
Figure GDA0004179282000000231
And does not include a policy objective model. The value model of each agent is divided into its value estimation network and a value target network consistent with the value estimation network structure. Estimating network input global state s of value model t Action of last time step->
Figure GDA0004179282000000232
And the average action of the last time step +.>
Figure GDA0004179282000000233
Output Q i Values. The value target network input layer is the global state value s of the next moment t+1 Action value->
Figure GDA0004179282000000234
Average action value ∈ ->
Figure GDA0004179282000000235
And output as +.>
Figure GDA0004179282000000236
Values. Evaluation network and target network of value model, in calculating Q i and />
Figure GDA0004179282000000237
The forward propagation mode is used when the parameters are updated, and the backward propagation mode is used when the parameters are updated.
In the framework, each agent independently contains a policy model for decentralized action execution and a value model containing centralized state action information. The reinforcement Learning algorithm can be divided into two types, namely a strategy gradient method and a Q-Learning method.
If the reinforcement learning algorithm adopts a strategy gradient method, the reinforcement learning algorithm will
Figure GDA0004179282000000238
Store to experience pool->
Figure GDA0004179282000000239
And is from experience pool->
Figure GDA00041792820000002310
A batch of samples is randomly sampled->
Figure GDA00041792820000002311
According to sample->
Figure GDA00041792820000002312
According to the accumulated return discount factor gamma and the loss function updating value, estimating the neural network parameters of the network, and updating the neural network parameters of the strategy model by using a gradient descent method; number of training rounds per interval Episode upnet In the method, parameters theta of the strategy model neural network are updated according to the weight coefficient omega updated by the target network i And parameters of the neural network of the value model +.>
Figure GDA00041792820000002313
Neural network parameters respectively transferred to corresponding policy target networks +.>
Figure GDA00041792820000002314
And neural network parameters of the value target network +.>
Figure GDA00041792820000002315
wherein ,/>
Figure GDA00041792820000002316
Representing global state s t+1 Global status representing next time step,/->
Figure GDA00041792820000002317
Representing the prize value of the dispatcher +.>
Figure GDA00041792820000002318
Represent the average action s t,j Representing the global state of the sampled sample s t+1,j Global state representing next time step of sampling a sample,/->
Figure GDA00041792820000002319
Strategy for representing sampled samples, ++>
Figure GDA00041792820000002320
Representing the average motion of the sampled samples, +.>
Figure GDA00041792820000002321
A prize value representing a dispatcher of the sampled samples;
if the reinforcement Learning algorithm adopts the Q-Learning method, the method will
Figure GDA00041792820000002322
Store to experience pool->
Figure GDA00041792820000002323
And is further from experience pool->
Figure GDA00041792820000002324
A batch of samples is randomly sampled->
Figure GDA00041792820000002325
In the sample- >
Figure GDA00041792820000002326
In the method, the neural network parameters of the value estimation network are updated according to the accumulated return discount factor gamma and the loss function; number of training rounds per interval Episode upnet And (3) transmitting the neural network parameters of the value model into the neural network parameters of the value target network according to the updated weight coefficient omega of the target network.
According to the embodiment of the invention, a shared bicycle scheduling framework based on multi-agent reinforcement learning of an average field theory is provided according to the provided shared bicycle scheduling optimization problem, and the aim is to enable an agent to change the riding demand, adapt to a random dynamic environment, realize cooperative dynamic decision optimization and increase riding travel. The basic idea of frame construction is as follows:
(1) Feasibility of solving shared bicycle scheduling problem by using reinforcement learning algorithm framework
In the constructed shared bicycle scheduling optimization problem, the information of the state of the supply amount at the current moment is known, and the history information of the past time is not needed. That is, the state of the supply amount at the present time is related to only the state of the supply amount at the previous time and the policy action executed, and is independent of the state of the supply amount at the other time and the decision action condition. Therefore, it can be considered that the state of the supply amount at this time sharing the bicycle schedule optimization problem has markov. The shared bicycle schedule optimization problem satisfies the assumption of markov, i.e. the current state contains all information.
Thus, the shared bicycle scheduling problem may be converted into a Markov decision process. The Markov decision process can be solved by a reinforcement learning framework, so that it is feasible to solve the shared bicycle scheduling optimization problem with the reinforcement learning framework. The reinforcement learning does not need label data, and can realize the self-learning of the high-dimensional mapping relation from the state without a model to the action. In view of the advantages of the reinforcement learning algorithm, the invention provides a shared bicycle scheduling optimization framework based on the reinforcement learning algorithm.
(2) Solving the problem existing in the shared bicycle scheduling problem based on the reinforcement algorithm framework
When the multi-agent reinforcement learning algorithm is used for solving the problem of sharing bicycle scheduling control, the problems of reduced learning effect, repeated scheduling of vehicles scheduled in the same area and the like can occur, as shown below.
First, multi-agent settings are established in a traditional reinforcement learning algorithm, and for each agent, the strategy is constantly changing, resulting in an unstable environment. Unstable environments violate markov state transition stability, causing policy estimation errors and reducing policy optimization efficiency or policy optimization failure.
In the DQN algorithm, first, its agent i learns and selects the best strategy by means of a separate Q-learning algorithm. However, in a multi-agent environment, agent i updates the policies independently during learning, and does not consider the policies of other agents. I.e. the environment to which any agent i belongs is a non-stationary environment. At different times t, probability of state transition
Figure GDA0004179282000000241
And not necessarily a stable value. However, the environment where the Q-learning algorithm converges proves requires that the state transition probability matrix must have a certain stability. The case of an unstable environment is contrary to this assumption. Second, in the conventional DQN algorithm, the agent i performs a random sampling operation on the data in the experience pool, and the sampled sample is used as a batch of training data of the neural network. The method can avoid the problem of state correlation in the original data. However, in a multi-agent environment, agent i may be an ineffective strategy in the state of the next unstable environment in the current state optimization strategy. The single agent DQN algorithm is an inefficient learning process for learning processes where invalid policy samples exist and increases the likelihood of failure of the empirical playback function. Therefore, in a multi-agent environment, the conventional DQN algorithm cannot guarantee that the value function can converge to the optimal value function.
In a gradient strategy algorithm, in an unstable environment, as the number of agents increases, the variance produced by the algorithm will also increase and the probability of the strategy being optimized in the correct direction will decrease.
Second, in the multi-agent deep reinforcement learning problem, agents are affected by the environment and other agents. During learning, the dimension of the joint action of the state-action value function will exponentially expand as the number of agents increases. Each agent estimates its own value function according to the joint strategy, and when the joint action space is large, the learning efficiency and learning effect will be reduced.
Third, if the estimation network of the reinforcement learning algorithm has low sensitivity to the input variable data, if the agent is affected by a higher reward value or a higher penalty value, the agent learns the weight of the parameters corresponding to the state and the action. This results in a further reduction in the sensitivity of the agent to changes in status and motion information and re-selecting the same motion. For a single agent, the case of too single action selection will increase the content monotonicity of the learning sample data, affecting the accuracy of the neural network fit sum in the estimated network. Inaccurate estimation of the scheduling vehicle strategy by the reinforcement learning algorithm will result in reduced scheduling efficiency.
Fourth, in the reinforcement learning algorithm of multiple agents, a phenomenon that multiple dispatching vehicles carry and share bicycle vehicles in the same cell occurs in the joint action. Non-cooperative scheduling strategies will result in scheduling inefficiencies, even in situations where certain units are over-scheduled and no vehicles are available for borrowing or sharing bicycle vehicles are over-piled.
(3) Basic idea of frame construction
According to the problems, the shared bicycle scheduling optimization framework based on multi-agent deep reinforcement learning needs to consider the dimension explosion problem caused by the number of agents in a stable environment, and solves the problem of high-efficiency cooperative cooperation among the multi-agents. The frame construction idea is as follows:
First, the agent learning structure of frame:
in the multi-agent deep reinforcement learning method based on the distributed structure, the reinforcement learning method may be classified into group reinforcement learning and independent reinforcement learning according to whether an agent considers state and behavior information of the agent.
If each agent in the multi-agent system can be regarded as an independent single agent without communication capability, that is, the agent does not consider the policy selection of other agents in the policy selection process, the algorithm can be called an independent learning type algorithm. In this case, shared information data can be obtained only between the agents by the collective communication method after the information is fed back by the external environment. Conversely, in the algorithm of the group learning category, multiple agents will be considered as a combined group, each agent also considering the policy choices of other agents during the learning process.
The independent learning method can avoid the problem of dimension explosion in the communication process caused by the increase of the number of the intelligent agents, and the method can reference the reinforcement learning algorithm in the static environment, but has the defects of slow convergence speed and long learning time. The group learning method has the advantages that the intelligent agents can fully communicate with each other to realize full cooperation, but the search space is large and the learning time is long in learning. In order to realize cooperation and communication among agents, strategies of other agents are considered in learning, and a group reinforcement learning method is adopted to construct a multi-agent-based deep reinforcement learning algorithm.
Second, the unstable environment of the frame improves:
for the problem of unstable environment improvement of multiple agents, if the agent i learns the action content of all agents in the learning process, the environment where the agent i is located can be changed into a stable environment. The status and activity content of the agent are set as known information herein to improve the unstable environment.
Third, the dimensional explosion problem caused by the increase in the number of agents in the framework improves:
average field gambling theory (Mean Field Theory, MFT) studies differentiated gambling of group objects consisting of rational gambling parties. The state of the other agents is still considered while the agents consider their own state. Classical cases of average field gaming are training shoals to swim in a collaborative manner. The fish do not pay attention to the swimming behavior of each fish in the population, but rather adjust their own behavior with the behavior of the shoal in the neighborhood. Average field gaming theory can describe the behavioral response of surrounding agents and the behavioral set of all agents by the Hamilton-Jacobi-Bellman equation and the Fokker-Planck-Kolmogorov equation. The Mean Field game theory-based Multi-agent reinforcement learning algorithm (Mean Field Multi-Agent Reinforcement Learning, MFMARL) algorithm assumes that the impact of all other agents on a certain agent can be represented by a Mean distribution. The MFMARL is suitable for the reinforcement learning problem of large-scale agents, and can simplify the interactive calculation among the agents. MFMARL solves the problem of expansion of the space of the value function due to an increase in the number of agents. Thus, MFMARL is incorporated herein in a shared bicycle scheduling framework and defines that each agent has the same discrete action space.
Fourth, the sensitivity of the reinforcement learning algorithm to state and motion information changes is improved in the framework:
in order to improve over-learning stability in a multi-agent deep reinforcement learning framework and improve the sensitivity of a neural network to state and action information changes, the framework adopts a one-hot coding mode as input of the neural network, and the reward function value is scaled and then processed by a hyperbolic tangent tanh (·) function.
Fifth, the framework improves the high-efficiency cooperative ability among the agents:
in order to improve the high-efficiency cooperative capability among the agents, the framework designs the action content of all the agents to be known in the learning process of the agent i, namely, the states and strategies of other agents can be learned. Furthermore, different forms of reward functions in the framework are discussed herein, the impact of which on the collaborative capability is studied.
Therefore, the invention aims at the problem of shared bicycle scheduling control, improves the stability of the learning environment of the multi-agent, and constructs the shared bicycle scheduling framework of multi-agent reinforcement learning based on the average field theory for multi-agent group learning.
The working principle and the working process of the invention are as follows: the invention aims to establish a general framework for shared bicycle scheduling based on multi-agent deep reinforcement learning of average field theory, so as to solve the shared bicycle scheduling problems of long-term scheduling process, dynamic environment and large-scale network. The stability of state transition of the multi-agent deep reinforcement learning algorithm, dimensional explosion, agent communication efficiency and agent exploration behaviors are considered. And a framework of a reinforcement learning algorithm is adopted, and a coordinated and effective dynamic strategy is obtained in a high-dimensional action space, so that travel requirements are met, and idle shared bicycles in a road are reduced. And combining reinforcement learning basic theory and shared bicycle scheduling system research, defining the division of area units, and constructing a shared bicycle scheduling optimization model.
Aiming at a high-dimensional multi-body action space, a shared bicycle scheduling frame for multi-agent deep reinforcement learning based on an average field theory is provided. The framework can be used to address long-term scheduling, dynamic environments, large-scale and complex networks. The framework does not need to predict the demand in advance or process the data, and is not affected by the calculation efficiency and accuracy of the demand prediction. And the framework is not the best strategy for each time period, but rather is an overall optimization of the overall scheduling process that takes into account supply and demand variations for future time periods and the impact of scheduling decisions on supply and demand for the next time period.
The beneficial effects of the invention are as follows:
(1) The shared bicycle scheduling optimization method based on reinforcement learning is beneficial to intelligently solving the problem of short-term and long-term scheduling optimization of shared bicycles of a large-scale road network under random and complex dynamic environments. The method does not need to predict the demand in advance or conduct manual data processing, and is not affected by the calculation efficiency and accuracy of demand prediction. And this method is not the optimal strategy for each time period, but is an overall optimization method for the entire scheduling process, which takes into account the supply and demand changes of the future time period and the influence of the scheduling decision on the supply and demand of the next time period.
(2) The dynamic optimization scheduling strategy provided by the invention can improve the scheduling operation efficiency. The travel amount and the utilization rate of the shared bicycle are improved, and the loss amount of the shared bicycle user demand is reduced. The shared bicycle idle rate in the road is reduced, and the number of idle vehicles with excessively high accumulation in certain areas is reduced. The waste of shared resources is reduced, and the phenomenon of urban environment caused by stacking of a large number of idle vehicles is improved.
(3) The actual travel amount of the sharing bicycle users is increased, so that the sharing rate in the connection traffic can be improved, and the operation efficiency of the public transportation system is improved. The service quality of the shared bicycle is improved, the shared bicycle is encouraged to replace a motor vehicle for traveling, urban congestion and motor vehicle exhaust emission are reduced, and social welfare is increased.
Those of ordinary skill in the art will recognize that the embodiments described herein are for the purpose of aiding the reader in understanding the principles of the present invention and should be understood that the scope of the invention is not limited to such specific statements and embodiments. Those of ordinary skill in the art can make various other specific modifications and combinations from the teachings of the present disclosure without departing from the spirit thereof, and such modifications and combinations remain within the scope of the present disclosure.

Claims (6)

1. The shared bicycle scheduling method based on deep reinforcement learning is characterized by comprising the following steps of:
s1: dividing a dispatching area of the shared bicycle to obtain a dispatching area unit, and determining a running environment variable of the shared bicycle;
s2: determining a scheduling variable of the shared bicycle according to the running environment variable of the shared bicycle based on the scheduling area unit;
s3: constructing a vehicle dispatching optimization model of the shared bicycle according to dispatching variables of the shared bicycle;
s4: based on a vehicle dispatching optimization model of the shared bicycle, constructing a shared bicycle dispatching frame by utilizing an average field theory, and completing shared bicycle dispatching by utilizing the shared bicycle dispatching frame;
in the step S1, the specific method for dividing the scheduling area of the shared bicycle is as follows: dividing a scheduling area of a shared bicycle into a plurality of identical equilateral hexagons as scheduling area units, and defining a global tag variable eta of each scheduling area unit 5 A horizontal direction label variable m, and a vertical direction label variable h, which satisfy the following relation:
Figure QLYQS_1
wherein ,η5 ∈M′,M′={0,1,...,((M+1) 2 -1) }, M represents the maximum value of the horizontal direction tag variable or the vertical direction tag variable of the scheduling area unit, M' represents the unit tag set of the scheduling area unit;
In the step S1, the running environment variables of the shared bicycle comprise a time variable and a city fixed warehouse position set variable;
the time variable comprises a time step variable T, a time step variable set T and a time step maximum value variable T max Wherein T e T, t= {0,1, T max };
The city fixed warehouse location set variable comprises a fixed warehouse location set eta w
In the step S2, the scheduling variables of the shared bicycle include a policy execution state variable class, a supply and demand environment variable class, a riding trip variable class and a scheduling policy variable class;
the policy execution state variable class comprises a policy execution state variable tr, wherein tr is {0,1};
at time step t, the supply and demand environment variable class comprises a shared bicycle travel demand variable of a dispatching area unit
Figure QLYQS_2
Shared bicycle supply variable ++of dispatch area unit when policy execution state variable tr=0>
Figure QLYQS_3
And a shared bicycle supply variable of the schedule area unit when the policy execution state variable tr=1 +.>
Figure QLYQS_4
At time step t, the riding trip variable class comprises a global tag eta of a scheduling area unit where an OD starting point of the shared bicycle trip is located 2 Global label eta of scheduling area unit where OD destination point of bicycle travel is shared 3 OD tag variable (eta) sharing bicycle travel 23 ) Sharing OD flow for bicycle travel
Figure QLYQS_5
Sharing bicycle slave eta 2 Go out and reach eta 3 Trip flow ratio +.>
Figure QLYQS_6
η 5 The actual travel variable of the resulting shared bicycle +.>
Figure QLYQS_7
and η5 The actual attraction variable of the shared bicycle +.>
Figure QLYQS_8
At time step t, the scheduling strategy variable class comprises a scheduling vehicle tag set I, a scheduling vehicle tag variable I and a scheduling vehicle starting unit tag variable
Figure QLYQS_10
The dispatcher reaches the unit tag variable->
Figure QLYQS_13
Variable set kappa of movement direction of dispatching truck 1 Scheduling ratio variable set κ 2 The dispatcher is from->
Figure QLYQS_17
Movement direction variable to adjacent six regular hexagons +.>
Figure QLYQS_12
Schedule ratio variable of the scheduler->
Figure QLYQS_15
Scheduling policy of the scheduler>
Figure QLYQS_20
Maximum capacity of the cabin of the dispatching vehicle>
Figure QLYQS_21
The dispatcher vehicle is from eta t i,0 Pick up and put in->
Figure QLYQS_9
Is a shared bicycle number variable->
Figure QLYQS_14
Scheduler arrival->
Figure QLYQS_18
And->
Figure QLYQS_19
Belongs to eta w The time scheduling car is put in eta i,1 The number of shared self-vehicles in the vehicle cabin being a ratio alpha of the number of vehicles in the vehicle cabin wh Eta in the case where a scheduling policy is expected to have been applied to a conventional scheduling vehicle before the scheduling vehicle applies the scheduling policy 5 An estimated cumulative increase amount variable +.>
Figure QLYQS_11
Increased benefit ∈after the scheduler enforces the scheduling policy>
Figure QLYQS_16
And the total amount Z of shared bicycles stored in the urban fixed warehouse at the end of the scheduled cycle time warehouse
Where i= {0,1,..n }, N represents the maximum value of the tag variable of the dispatching truck, I epsilon I kappa 1 ={0,1,...,5},κ 2 ={0,0.25,0.5,0.75},
Figure QLYQS_22
In the step S3, the vehicle scheduling optimization model of the shared bicycle specifically includes:
Figure QLYQS_23
s.t.
Figure QLYQS_24
Figure QLYQS_25
Figure QLYQS_26
Figure QLYQS_27
Figure QLYQS_28
Figure QLYQS_29
Figure QLYQS_30
Figure QLYQS_31
Figure QLYQS_32
Figure QLYQS_33
/>
Figure QLYQS_34
Figure QLYQS_35
Figure QLYQS_36
Figure QLYQS_37
in the vehicle dispatching optimization model, the income increased after the dispatching vehicles implement the dispatching strategy
Figure QLYQS_38
Maximizing objective function as a short-term schedule optimization problem for shared bicycles>
Figure QLYQS_39
The calculation formula is +.>
Figure QLYQS_40
Wherein T represents a time step, T max Maximum value variable representing time step, i representing the schedule tag variable, N representing the maximum value of the schedule tag variable, +.>
Figure QLYQS_41
Representing a scheduling strategy of a scheduling vehicle;
action decision when time step variable t policy execution state variable tr=0
Figure QLYQS_42
Calculation of (2)The formula is
Figure QLYQS_43
wherein ,/>
Figure QLYQS_44
Indicating the dispatcher from->
Figure QLYQS_45
A movement direction variable to six adjacent regular hexagons,/->
Figure QLYQS_46
A scheduling ratio variable representing a scheduling vehicle;
when the time step variable t policy execution state variable tr=0, the global tag variable η of the schedule area unit 5 And a global tag eta sharing a scheduling area unit where the OD starting point of the bicycle trip is located 2 Sharing OD tag variable (η) for bicycle travel at the same time 2 ,η 3 ) Is a shared bicycle path flow
Figure QLYQS_47
The calculation formula of (2) is
Figure QLYQS_48
And is also provided with
Figure QLYQS_49
Wherein INT (·) represents a down integer value, < > >
Figure QLYQS_50
Shared bicycle travel demand variable representing dispatch area unit, +.>
Figure QLYQS_51
Representing the shared bicycle supply variable in the dispatch area unit when the initial given supply is t=0,/>
Figure QLYQS_52
Representation ofShared bicycle supply variable in dispatch area unit when policy execution state variable tr=1, +.>
Figure QLYQS_53
Representing a shared bicycle slave eta 2 Go out and reach eta 3 M' represents a unit tag set of the scheduling area unit;
global tag eta of unit of scheduling area where OD starting point of shared bicycle travel is located 2 Trip flow rate ratio as starting point
Figure QLYQS_54
The sum of (2) is 1, and the calculation formula is +.>
Figure QLYQS_55
Wherein T represents a time step variable set, η 3 Global tags representing units where OD destination points of the shared bicycle travel are located;
according to the path flow
Figure QLYQS_56
When the global tag variable η of the schedule area unit is the time step t policy execution state tr=0 5 And a global tag eta sharing the unit where the OD starting point of the bicycle trip is located 2 When the same, the bicycle path flow is shared
Figure QLYQS_57
Is taken as the global tag variable eta of the dispatch area unit 5 Is the actual travel amount of the shared bicycle +.>
Figure QLYQS_58
The calculation formula is +.>
Figure QLYQS_59
When the time step variable t policy execution state variable tr=0, the global tag variable η of the schedule area unit 5 Sharing bicycle travelGlobal tag η of the cell where the OD destination is located 3 When the same, the bicycle path flow is shared
Figure QLYQS_60
Is taken as the global tag variable eta of the dispatch area unit 5 Is the actual attraction of the shared bicycle>
Figure QLYQS_61
The calculation formula is as follows
Figure QLYQS_62
When the time step variable t policy execution state variable tr=0, the bicycle supply amount is shared
Figure QLYQS_63
Updating according to the renting and parked sharing bicycle number in the travel activity of the rider, wherein the calculation formula is as follows
Figure QLYQS_64
wherein ,/>
Figure QLYQS_65
A shared bicycle supply variable representing a policy execution state variable tr=1 after a scheduling policy has been applied at time step (t-1), a +.>
Figure QLYQS_66
Represents the time step eta of t 5 Is a shared bicycle actual travel variable, +.>
Figure QLYQS_67
Representing t time step eta 5 Is used for sharing the actual attraction variable of the bicycle;
when the time step variable t policy execution state variable tr=0, the unit tag variable that the scheduler will arrive at (t+1) time step
Figure QLYQS_68
The calculation formula of (2) is
Figure QLYQS_69
Figure QLYQS_70
Wherein m represents a horizontal direction tag variable of the scheduling area unit, h represents a vertical direction tag variable of the scheduling area unit,/o>
Figure QLYQS_71
A start unit tag variable representing the time step of the scheduler (t+1), +.>
Figure QLYQS_72
Representing the slave eta of the dispatching vehicle i,0 Moving to the moving direction variable of six adjacent regular hexagons;
When the time step variable t policy execution state variable tr=0, η 5 The estimated cumulative increase/decrease amount of the supply amount of (2)
Figure QLYQS_73
The calculation formula of (2) is +.>
Figure QLYQS_74
wherein ,/>
Figure QLYQS_75
Indicating that the (i-1) th dispatcher vehicle predicts the slave eta 5 Pick up shared number of bicycle, alpha wh Indicating arrival eta of the dispatcher i,1 And eta i,1 Belongs to eta w The time scheduling car is put in eta i,1 Is a ratio of the number of shared self-vehicles to the number of vehicles in the cabin, eta w Representing a fixed warehouse location set;
at time step t policy execution state variable tr=0, the dispatcher vehicle is driven from η i,0 Will be
Figure QLYQS_77
The shared bicycle of the vehicle is picked up and put into the cabin of the dispatching vehicle, and the +.>
Figure QLYQS_80
The shared bicycles of the vehicles are all put in eta i,1 In the number of vehicles picked up by the dispatcher +.>
Figure QLYQS_82
The calculation formula of (2) is
Figure QLYQS_78
And is also provided with
Figure QLYQS_79
Wherein, min (·) represents taking the minimum, < ->
Figure QLYQS_81
Represents the supply amount, η when the policy execution state variable tr=0 i,0 A start unit tag variable representing a dispatcher, +.>
Figure QLYQS_83
Representing the maximum capacity of the cabin of the dispatcher vehicle, +.>
Figure QLYQS_76
A scheduling ratio variable representing a scheduling vehicle;
when the time step t strategy execution state variable tr=1, the number of vehicles picked up according to the dispatching vehicle
Figure QLYQS_84
Executing the scheduling policy and updating eta 5 Obtaining eta after implementing the scheduling policy 5 Is a shared bicycle supply variable->
Figure QLYQS_85
The calculation formula is as follows
Figure QLYQS_86
Total amount Z of shared bicycles stored in urban fixed warehouse warehouse The calculation mode of (a) is that
Figure QLYQS_87
Said step S4 comprises the sub-steps of:
s41: determining elements of a shared bicycle scheduling framework based on a vehicle scheduling optimization model of the shared bicycle;
s42: determining average actions by using a one-hot coding mode;
s43: defining experience pool variables and training round related variables of a shared bicycle scheduling framework;
s44: based on the average field theory, the shared bicycle scheduling frame is constructed according to the elements, the average actions, the experience pool variables and the training round related variables of the shared bicycle scheduling frame.
2. The method for sharing a bicycle schedule according to claim 1, wherein in step S41, the elements of the shared bicycle schedule frame include states
Figure QLYQS_88
Behavior parameter a t And a bonus function, wherein,
Figure QLYQS_89
representing the state of the dispatcher at the time step variable t,
Figure QLYQS_90
a scheduling strategy of a scheduling vehicle when the time step variable t is represented;
the rewarding function comprises a rewarding function for actually increasing the travel amount of the dispatching truck
Figure QLYQS_91
Average increase trip amount rewarding function of dispatching vehicle>
Figure QLYQS_92
And the dispatching truck globally increases the travel function +.>
Figure QLYQS_93
The specific formula is as follows: />
Figure QLYQS_94
Figure QLYQS_95
Figure QLYQS_96
wherein ,αrw Representing the scaling factor of the bonus function,
Figure QLYQS_98
indicating +. >
Figure QLYQS_101
Is to go out and go in->
Figure QLYQS_106
Indicating +.>
Figure QLYQS_99
Is to go out and go in->
Figure QLYQS_102
Indicating +.>
Figure QLYQS_107
Is to go out and go in->
Figure QLYQS_110
Indicating +.>
Figure QLYQS_97
Is to go out and go in->
Figure QLYQS_103
Represents the time step variable t +.>
Figure QLYQS_105
Number of inner dispatch vehicles>
Figure QLYQS_109
Represents the time step variable t +.>
Figure QLYQS_100
Number of inner dispatch vehicles>
Figure QLYQS_104
Representing η when implementing a scheduling policy 5 Is to go out and go in->
Figure QLYQS_108
Represents eta when no scheduling policy is implemented 5 N represents the maximum value of the tag variable of the dispatching vehicle, eta 5 The global tag representing each scheduling region unit, M' represents the set of unit tags for the scheduling region unit.
3. The method for sharing a bicycle according to claim 1, wherein in step S42, the specific method for determining the average motion is as follows: method for rewriting scheduling policy of scheduling vehicle by using one-hot coding mode
Figure QLYQS_111
Get average action +.>
Figure QLYQS_112
The calculation formulas are respectively as follows:
Figure QLYQS_113
Figure QLYQS_114
wherein ,
Figure QLYQS_115
a variable denoted 0 or 1 ρ dim Represents the dimension of the scheduling policy, N represents the maximum value of the schedule tag variable,
Figure QLYQS_116
representing i ne Action policy of i ne A tag variable indicating a different scheduler than the scheduler i.
4. The method for sharing a bicycle schedule according to claim 1, wherein the experience pool variables of the shared bicycle schedule frame in step S43 include an experience pool
Figure QLYQS_117
And experience pool capacity->
Figure QLYQS_118
The training round related variables comprise training round number Episode and updating training round number Episode upnet The weight coefficient omega updated by the target network and the accumulated return discount factor gamma.
5. The shared bicycle scheduling method based on deep reinforcement learning as claimed in claim 1, wherein the step S44 includes the sub-steps of:
s441: initializing an experience pool
Figure QLYQS_119
Setting experience pool capacity +.>
Figure QLYQS_120
Weight coefficient omega updated by target network and rewarding function scaling coefficient alpha rw Accumulated return discount factor gamma, initial given supply +.>
Figure QLYQS_121
Shared bicycle travel demand variable +.>
Figure QLYQS_122
And sharing bicycle slave eta 2 Go out and reach eta 3 Trip flow ratio +.>
Figure QLYQS_123
And based on the training round number Episode upnet Steps S442-S445 are performed in a loop;
s442: updating the shared bicycle operating environment when the policy execution state variable tr=0;
s443: updating the status of each dispatch car
Figure QLYQS_124
And scheduling policy->
Figure QLYQS_125
S444: updating the shared bicycle operating environment when the policy execution state variable tr=1;
s445: updating the next time step status of each scheduler
Figure QLYQS_126
Average actions->
Figure QLYQS_127
And updating the increased benefit ++of the scheduler after implementing the scheduling policy according to the reward function>
Figure QLYQS_128
S446: based on the update process of steps S442-S445, a shared bicycle scheduling frame is constructed using the reinforcement learning algorithm, and the shared bicycle scheduling is completed using the shared bicycle scheduling frame.
6. The method for sharing a bicycle scheduling according to claim 1, wherein in step S442, the specific method for updating the shared bicycle running environment when the policy execution state variable tr=0 is: updating and calculating OD tag variable (eta) of travel of shared bicycle 2 ,η 3 ) Is a shared bicycle path flow
Figure QLYQS_129
η 5 The actual travel variable of the resulting shared bicycle +.>
Figure QLYQS_130
η 5 The actual attraction variable of the shared bicycle +.>
Figure QLYQS_131
And the supply amount ++when the policy execution state variable tr=0>
Figure QLYQS_132
In step S444, the specific method for updating the shared bicycle operating environment when the policy execution state variable tr=1 is: updating the computational behavior parameters a t The dispatching vehicle is from eta i,0 Pick up and put in reach eta i,1 Shared bicycle number variable
Figure QLYQS_133
And the supply amount ++when the policy execution state variable tr=1>
Figure QLYQS_134
In step S446, the reinforcement Learning algorithm adopts a strategy gradient method or a Q-Learning method;
each dispatching truck comprises a strategy model and a value model; in the strategy gradient method, the strategy of each dispatching truckThe model comprises a strategy estimation network and a strategy target network; the strategy estimation network constructs a neural network, and parameters of the neural network are theta i The input is the state of the dispatching vehicle
Figure QLYQS_135
The output is scheduling policy->
Figure QLYQS_136
The strategy target network is formed by constructing a neural network, and parameters of the neural network are +.>
Figure QLYQS_137
Inputting the state of the schedule car in the next time step variable +.>
Figure QLYQS_138
Scheduling strategy for outputting next time step variable>
Figure QLYQS_139
In the strategy gradient method and the Q-Learning method, a value model of each dispatching vehicle comprises a value estimation network and a value target network; the value estimation network is formed by constructing a neural network, and parameters of the neural network are as follows
Figure QLYQS_142
Inputting the state of a dispatcher
Figure QLYQS_143
Scheduling policy->
Figure QLYQS_147
Average actions->
Figure QLYQS_141
Output Q value function Q i Wherein the Q-value function refers to a state-action value function in a reinforcement learning algorithm that represents a cumulative prize value attained by the scheduler; the value target network is formed by constructing a neural networkThe parameters of the network are->
Figure QLYQS_145
Inputting the state of the next time step variable of the dispatching truck>
Figure QLYQS_146
Scheduling policy for next time step variable +.>
Figure QLYQS_148
And the average action of the next time step variable +.>
Figure QLYQS_140
Output target Q value function +.>
Figure QLYQS_144
In the Q-Learning method, in the strategy model of the dispatching vehicle, the method is based on the formula
Figure QLYQS_152
and />
Figure QLYQS_154
Probability sampling and selecting to obtain action->
Figure QLYQS_158
wherein ,/>
Figure QLYQS_151
Representing t-1 time step i ne Average action of i ne Tag variable ω representing a different scheduler than scheduler i d Representing policy parameters->
Figure QLYQS_155
Is Q i Functional expression form,/->
Figure QLYQS_160
Representing an action probability calculation function, A i Representation->
Figure QLYQS_161
Action space set of (2) and according to +.>
Figure QLYQS_149
Update->
Figure QLYQS_153
Substitution formula->
Figure QLYQS_157
According to->
Figure QLYQS_159
Probability sampling and obtaining->
Figure QLYQS_150
Will->
Figure QLYQS_156
Act as final choice of policy model of the dispatching truck;
if the reinforcement learning algorithm adopts a strategy gradient method, the reinforcement learning algorithm will
Figure QLYQS_164
Store to experience pool->
Figure QLYQS_167
And is from experience pool->
Figure QLYQS_173
A batch of samples is randomly sampled->
Figure QLYQS_165
According to sample->
Figure QLYQS_169
Updating neural network parameters of the value estimation network based on the cumulative return discount factor gamma and the loss function>
Figure QLYQS_172
Neural network parameter theta for updating strategy model by gradient descent method i The method comprises the steps of carrying out a first treatment on the surface of the Number of training rounds per interval Episode upnet In the method, parameters theta of the strategy model neural network are updated according to the weight coefficient omega updated by the target network i And parameters of the neural network of the value model +.>
Figure QLYQS_175
Neural network parameters respectively transferred to corresponding policy target networks +.>
Figure QLYQS_162
And neural network parameters of the value target network +.>
Figure QLYQS_168
wherein ,/>
Figure QLYQS_170
Representing global state s t+1 Representing the global state of the next time step, r t i Representing the prize value of the dispatcher +.>
Figure QLYQS_174
Represent the average action s t,j Representing global states in a sample, s t+1,j Global state representing next time step in the sample,/- >
Figure QLYQS_163
Representing policy in the sample, +_s>
Figure QLYQS_166
Representing the average action in the sample,/->
Figure QLYQS_171
Representing a reward value of the dispatching vehicle in the sampling sample;
if the reinforcement Learning algorithm adopts the Q-Learning method, the method will
Figure QLYQS_178
Store to experience pool->
Figure QLYQS_179
And is further from experience pool->
Figure QLYQS_183
A batch of samples is randomly sampled->
Figure QLYQS_177
According to sample->
Figure QLYQS_180
Updating neural network parameters of a value estimation network by a cumulative return discount factor gamma and a loss function>
Figure QLYQS_181
Number of training rounds per interval Episode upnet In the neural network parameters of the value model according to the weight coefficient omega updated by the target network +.>
Figure QLYQS_182
Neural network parameters transferred to the value target network>
Figure QLYQS_176
Is a kind of medium. />
CN202110744265.2A 2021-04-20 2021-06-30 Shared bicycle scheduling method based on deep reinforcement learning Active CN113326993B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110421814 2021-04-20
CN2021104218142 2021-04-20

Publications (2)

Publication Number Publication Date
CN113326993A CN113326993A (en) 2021-08-31
CN113326993B true CN113326993B (en) 2023-06-09

Family

ID=77425362

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110744265.2A Active CN113326993B (en) 2021-04-20 2021-06-30 Shared bicycle scheduling method based on deep reinforcement learning

Country Status (1)

Country Link
CN (1) CN113326993B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113997926A (en) * 2021-11-30 2022-02-01 江苏浩峰汽车附件有限公司 Parallel hybrid electric vehicle energy management method based on layered reinforcement learning
CN115796399B (en) * 2023-02-06 2023-04-25 佰聆数据股份有限公司 Intelligent scheduling method, device, equipment and storage medium based on electric power supplies
CN116307251B (en) * 2023-04-12 2023-09-19 哈尔滨理工大学 Work schedule optimization method based on reinforcement learning
CN116402323B (en) * 2023-06-09 2023-09-01 华东交通大学 Taxi scheduling method
CN116824861B (en) * 2023-08-24 2023-12-05 北京亦庄智能城市研究院集团有限公司 Method and system for scheduling sharing bicycle based on multidimensional data of urban brain platform

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111582469A (en) * 2020-03-23 2020-08-25 成都信息工程大学 Multi-agent cooperation information processing method and system, storage medium and intelligent terminal
CN112068515A (en) * 2020-08-27 2020-12-11 宁波工程学院 Full-automatic parking lot scheduling method based on deep reinforcement learning
CN112348258A (en) * 2020-11-09 2021-02-09 合肥工业大学 Shared bicycle predictive scheduling method based on deep Q network
CN112417753A (en) * 2020-11-04 2021-02-26 中国科学技术大学 Urban public transport resource joint scheduling method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111582469A (en) * 2020-03-23 2020-08-25 成都信息工程大学 Multi-agent cooperation information processing method and system, storage medium and intelligent terminal
CN112068515A (en) * 2020-08-27 2020-12-11 宁波工程学院 Full-automatic parking lot scheduling method based on deep reinforcement learning
CN112417753A (en) * 2020-11-04 2021-02-26 中国科学技术大学 Urban public transport resource joint scheduling method
CN112348258A (en) * 2020-11-09 2021-02-09 合肥工业大学 Shared bicycle predictive scheduling method based on deep Q network

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
A Deep Learing Model for traffic Flow State Classification Based on Smart Phone Sensor Data;涂雯雯等;《arXiv preprint arXiv》;1709.08802 *
Deep Reinforcement Learning with double Q-Learning;van Hasselt等;《30th Association-for-the-Advancement-of-Artificial-Intelligence(AAAI) Conference on Artificial Intelligence》;2094-2100 *
Ibrahim Althamary等.A Survey on Multi-Agent Reinforcement Learning Methods for Vehicular Networks.《2019 15th International Wireless Communications &amp Mobile Computing Conference(IWCMC)》.2019,1154-1159. *
共享单车系统的平均场理论与闭排队网络研究;樊瑞娜;《中国博士学位论文全文数据库 经济与管理科学辑》(第1期);J151-1 *
共享单车调度路径优化研究;陈佳惠等;《交通科技与经济》;第23卷(第2期);13-20 *

Also Published As

Publication number Publication date
CN113326993A (en) 2021-08-31

Similar Documents

Publication Publication Date Title
CN113326993B (en) Shared bicycle scheduling method based on deep reinforcement learning
CN111862579B (en) Taxi scheduling method and system based on deep reinforcement learning
Liu et al. Distributed and energy-efficient mobile crowdsensing with charging stations by deep reinforcement learning
CN107831685B (en) Group robot control method and system
CN110659796A (en) Data acquisition method in rechargeable group vehicle intelligence
CN108492568A (en) A kind of Short-time Traffic Flow Forecasting Methods based on space-time characterisation analysis
US20220041076A1 (en) Systems and methods for adaptive optimization for electric vehicle fleet charging
CN112417753A (en) Urban public transport resource joint scheduling method
CN117541026B (en) Intelligent logistics transport vehicle dispatching method and system
CN111352713B (en) Automatic driving reasoning task workflow scheduling method oriented to time delay optimization
Shi et al. Deep q-network based route scheduling for transportation network company vehicles
CN113592162A (en) Multi-agent reinforcement learning-based multi-underwater unmanned aircraft collaborative search method
CN117350424A (en) Economic dispatching and electric vehicle charging strategy combined optimization method in energy internet
CN116227773A (en) Distribution path optimization method based on ant colony algorithm
CN115759915A (en) Multi-constraint vehicle path planning method based on attention mechanism and deep reinforcement learning
Wang et al. Optimization of ride-sharing with passenger transfer via deep reinforcement learning
CN116614394A (en) Service function chain placement method based on multi-target deep reinforcement learning
CN113673836B (en) Reinforced learning-based shared bus line-attaching scheduling method
CN117539929A (en) Lamp post multi-source heterogeneous data storage device and method based on cloud network edge cooperation
CN112750298A (en) Truck formation dynamic resource allocation method based on SMDP and DRL
CN117068393A (en) Star group collaborative task planning method based on mixed expert experience playback
CN117420824A (en) Path planning method based on intelligent ant colony algorithm with learning capability
CN116739466A (en) Distribution center vehicle path planning method based on multi-agent deep reinforcement learning
CN116128028A (en) Efficient deep reinforcement learning algorithm for continuous decision space combination optimization
CN115187056A (en) Multi-agent cooperative resource allocation method considering fairness principle

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant