CN113326993B - Shared bicycle scheduling method based on deep reinforcement learning - Google Patents
Shared bicycle scheduling method based on deep reinforcement learning Download PDFInfo
- Publication number
- CN113326993B CN113326993B CN202110744265.2A CN202110744265A CN113326993B CN 113326993 B CN113326993 B CN 113326993B CN 202110744265 A CN202110744265 A CN 202110744265A CN 113326993 B CN113326993 B CN 113326993B
- Authority
- CN
- China
- Prior art keywords
- variable
- scheduling
- shared bicycle
- eta
- dispatching
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 124
- 230000002787 reinforcement Effects 0.000 title claims abstract description 74
- 238000005457 optimization Methods 0.000 claims abstract description 63
- 238000004364 calculation method Methods 0.000 claims abstract description 39
- 238000005290 field theory Methods 0.000 claims abstract description 13
- 230000000694 effects Effects 0.000 claims abstract description 7
- 230000006870 function Effects 0.000 claims description 77
- 230000009471 action Effects 0.000 claims description 64
- 238000004422 calculation algorithm Methods 0.000 claims description 52
- 238000013528 artificial neural network Methods 0.000 claims description 49
- 230000008569 process Effects 0.000 claims description 29
- 238000012549 training Methods 0.000 claims description 27
- 230000001186 cumulative effect Effects 0.000 claims description 17
- 230000008901 benefit Effects 0.000 claims description 10
- 238000005070 sampling Methods 0.000 claims description 10
- 230000006399 behavior Effects 0.000 claims description 9
- 241001225883 Prosopis kuntzei Species 0.000 claims description 3
- 238000011478 gradient descent method Methods 0.000 claims description 3
- 238000006467 substitution reaction Methods 0.000 claims description 3
- 230000001537 neural effect Effects 0.000 claims 1
- 230000007774 longterm Effects 0.000 abstract description 8
- 230000009286 beneficial effect Effects 0.000 abstract description 5
- 230000008859 change Effects 0.000 abstract description 5
- 230000003993 interaction Effects 0.000 abstract description 4
- 238000012545 processing Methods 0.000 abstract description 3
- 239000003795 chemical substances by application Substances 0.000 description 136
- 230000007704 transition Effects 0.000 description 10
- 230000009916 joint effect Effects 0.000 description 9
- 230000006854 communication Effects 0.000 description 5
- 238000004880 explosion Methods 0.000 description 5
- 238000004891 communication Methods 0.000 description 4
- 238000011156 evaluation Methods 0.000 description 4
- 230000009467 reduction Effects 0.000 description 4
- 230000035945 sensitivity Effects 0.000 description 4
- 208000001613 Gambling Diseases 0.000 description 3
- 238000010276 construction Methods 0.000 description 3
- 238000009826 distribution Methods 0.000 description 3
- 238000004519 manufacturing process Methods 0.000 description 3
- 241000251468 Actinopterygii Species 0.000 description 2
- 238000009825 accumulation Methods 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 230000002452 interceptive effect Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 239000013598 vector Substances 0.000 description 2
- 239000002699 waste material Substances 0.000 description 2
- 230000003542 behavioural effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 230000021824 exploration behavior Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 239000000758 substrate Substances 0.000 description 1
- 230000009182 swimming Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/04—Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F30/00—Computer-aided design [CAD]
- G06F30/10—Geometric CAD
- G06F30/15—Vehicle, aircraft or watercraft design
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F30/00—Computer-aided design [CAD]
- G06F30/20—Design optimisation, verification or simulation
- G06F30/27—Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/06—Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
- G06Q10/063—Operations research, analysis or management
- G06Q10/0631—Resource planning, allocation, distributing or scheduling for enterprises or organisations
- G06Q10/06315—Needs-based resource requirements planning or analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/40—Business processes related to the transportation industry
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2111/00—Details relating to CAD techniques
- G06F2111/04—Constraint-based CAD
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2111/00—Details relating to CAD techniques
- G06F2111/08—Probabilistic or stochastic CAD
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2119/00—Details relating to the type or aim of the analysis or the optimisation
- G06F2119/12—Timing analysis or timing optimisation
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Business, Economics & Management (AREA)
- Human Resources & Organizations (AREA)
- General Physics & Mathematics (AREA)
- Economics (AREA)
- Strategic Management (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Geometry (AREA)
- Entrepreneurship & Innovation (AREA)
- General Health & Medical Sciences (AREA)
- Marketing (AREA)
- Tourism & Hospitality (AREA)
- Artificial Intelligence (AREA)
- General Business, Economics & Management (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Computer Hardware Design (AREA)
- Operations Research (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Development Economics (AREA)
- Computing Systems (AREA)
- Game Theory and Decision Science (AREA)
- Molecular Biology (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Quality & Reliability (AREA)
- Medical Informatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Educational Administration (AREA)
- Automation & Control Theory (AREA)
- Aviation & Aerospace Engineering (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
Abstract
The invention discloses a shared bicycle scheduling method based on deep reinforcement learning, which comprises the following steps: s1: dividing a dispatching area of the shared bicycle to obtain a dispatching area unit, and determining a running environment variable of the shared bicycle; s2: determining a scheduling variable of the shared bicycle; s3: constructing a vehicle dispatching optimization model of the shared bicycle; s4: based on a vehicle dispatching optimization model of the shared bicycle, constructing a shared bicycle dispatching frame by utilizing an average field theory, and completing shared bicycle dispatching by utilizing the shared bicycle dispatching frame. The shared bicycle scheduling optimization method based on reinforcement learning is beneficial to intelligently solving the problem of short-term and long-term scheduling optimization of shared bicycles of a large-scale road network under random and complex dynamic environments. The method considers the supply and demand change of the environment and the interaction effect of scheduling decision and the environment in future time, does not need to predict the demand in advance or perform manual data processing, and is not influenced by the calculation efficiency and accuracy of demand prediction.
Description
Technical Field
The invention belongs to the technical field of vehicle scheduling, and particularly relates to a shared bicycle scheduling method based on deep reinforcement learning.
Background
In the prior art, the problem of optimizing bicycle scheduling is generally solved by dividing the scheduling time into different time periods and independently searching the optimal scheduling strategy in each time period based on the division. However, the scheduling policy for the previous time period will affect the supply and demand environment for the next and future time periods. For the time period based isolated policy optimization method, the supply and demand conditions of the future time period and the influence caused by the implemented policy are not considered. Under this method, the optimal strategy in this time period does not necessarily promote the generation of a higher actual trip amount in the future time, and even causes the situation that the actual trip amount in the future is lower. Therefore, the optimal global strategy for the full scheduling time is not necessarily obtained by adopting the isolated strategy optimization method based on the time period.
Disclosure of Invention
The invention aims to solve the problem of shared bicycle scheduling in a long-term scheduling process, a dynamic environment and a large-scale network, and provides a shared bicycle scheduling method based on deep reinforcement learning.
The technical scheme of the invention is as follows: the shared bicycle scheduling method based on deep reinforcement learning comprises the following steps:
s1: dividing a dispatching area of the shared bicycle to obtain a dispatching area unit, and determining a running environment variable of the shared bicycle;
s2: determining a scheduling variable of the shared bicycle according to the running environment variable of the shared bicycle based on the scheduling area unit;
s3: constructing a vehicle dispatching optimization model of the shared bicycle according to dispatching variables of the shared bicycle;
s4: based on a vehicle dispatching optimization model of the shared bicycle, constructing a shared bicycle dispatching frame by utilizing an average field theory, and completing shared bicycle dispatching by utilizing the shared bicycle dispatching frame.
Further, in step S1, the specific method for dividing the scheduling area of the shared bicycle is as follows: dividing a scheduling area of a shared bicycle into a plurality of equilateral hexagons as scheduling area units, and defining a global tag variable eta of each scheduling area unit 5 A horizontal direction label variable m, and a vertical direction label variable h, which satisfy the following relation:
wherein ,η5 ∈M′,M′={0,1,...,((M+1) 2 -1) }, M represents the maximum value of the horizontal direction tag variable or the vertical direction tag variable of the scheduling area unit, M' represents the unit tag set of the scheduling area unit;
In step S1, the running environment variables of the shared bicycle comprise time variables and city fixed warehouse position set variables;
the time variable comprises a time step variable T, a time step variable set T and a time step maximum value variable T max Wherein T e T, t= {0,1, T max };
The city fixed warehouse location set variable comprises a fixed warehouse location set eta w 。
Further, in step S2, the scheduling variables of the shared bicycle include a policy execution state variable class, a supply and demand environment variable class, a riding trip variable class, and a scheduling policy variable class;
the policy execution state variable class includes a policy execution state variable tr, where tr is {0,1};
at time step t, the supply and demand environment variable class includes shared bicycle travel demand variables of the dispatch area unitShared bicycle supply variable ++of dispatch area unit when policy execution state variable tr=0>And a shared bicycle supply variable of the schedule area unit when the policy execution state variable tr=1 +.>
At time step t, the riding trip variable class comprises a global tag eta of a scheduling area unit where an OD starting point of the shared bicycle trip is located 2 Global label eta of scheduling area unit where OD destination point of bicycle travel is shared 3 OD tag variable (eta) sharing bicycle travel 2 ,η 3 ) Sharing OD flow for bicycle travelSharing bicycle slave eta 2 Go out and reach eta 3 Trip flow ratio +.>η 5 The actual travel variable of the resulting shared bicycle +.> and η5 The actual attraction variable of the shared bicycle +.>
At time step t, the scheduling strategy variable class comprises a scheduling vehicle tag set I, a scheduling vehicle tag variable I and a scheduling vehicle starting unit tag variableThe dispatcher reaches the unit tag variable->Variable set kappa of movement direction of dispatching truck 1 Scheduling ratio variable set κ 2 The dispatcher is from->Movement direction variable to adjacent six regular hexagons +.>Schedule ratio variable of the scheduler->Scheduling policy of the scheduler>Maximum capacity of the cabin of the dispatching vehicle>The dispatcher is from->Pick up and put in->Is a shared bicycle number variable->Scheduler arrival->And->Belongs to eta w The time scheduling car is put in eta i, 1 ratio alpha of the number of shared self-vehicles to the number of vehicles in the cabin wh Eta in the case where a scheduling policy is expected to have been applied to a conventional scheduling vehicle before the scheduling vehicle applies the scheduling policy 5 An estimated cumulative increase amount variable +.>Increased benefit ∈after the scheduler enforces the scheduling policy>And the total amount Z of shared bicycles stored in the urban fixed warehouse at the end of the scheduled cycle time warehouse ;
Where i= {0,1,..n }, N represents the maximum value of the tag variable of the dispatching truck, I epsilon I kappa 1 ={0,1,...,5},κ 2 ={0,0.25,0.5,0.75},
Further, step S4 comprises the sub-steps of:
s41: determining elements of a shared bicycle scheduling framework based on a vehicle scheduling optimization model of the shared bicycle;
s42: determining average actions by using a one-hot coding mode;
s43: defining experience pool variables and training round related variables of a shared bicycle scheduling framework;
s44: based on the average field theory, the shared bicycle scheduling frame is constructed according to the elements, the average actions, the experience pool variables and the training round related variables of the shared bicycle scheduling frame.
Further, in step S41, the elements of the shared bicycle scheduling frame include a statusBehavior parameter a t And a reward function, wherein->Representing the state of the dispatcher at the time step variable t, -, etc. ->A scheduling strategy of a scheduling vehicle when the time step variable t is represented;
the rewarding function comprises a rewarding function for actually increasing the travel amount of the dispatching truckAverage increase trip amount rewarding function of dispatching vehicle>And the dispatching truck globally increases the travel function +.>The specific formula is as follows: />
wherein ,αrw Representing the scaling factor of the bonus function,indicating +.>Is to go out and go in- >Indicating +.>Is to go out and go in->Indicating +.>Is used for the actual travel amount of the vehicle,indicating +.>Is to go out and go in->Represents the time step variable t +.>Number of inner dispatch vehicles>Represents the time step variable t +.>Number of inner dispatch vehicles>Representing η when implementing a scheduling policy 5 Is to go out and go in->Represents eta when no scheduling policy is implemented 5 N represents the maximum value of the tag variable of the dispatcher,η 5 the global tag representing each scheduling region unit, M' represents the set of unit tags for the scheduling region unit.
Further, in step S42, the specific method for determining the average motion is as follows: method for rewriting scheduling policy of scheduling vehicle by using one-hot coding modeGet average action +.>The calculation formulas are respectively as follows:
wherein ,a variable denoted 0 or 1 ρ dim Dimension representing scheduling policy, N represents the maximum value of the variable of the tag of the scheduling car,/for the maximum value of the variable of the tag of the scheduling car>Representing i ne Action policy of i ne A tag variable indicating a different scheduler than the scheduler i.
Further, in step S43, the experience pool variables of the shared bicycle scheduling frame include an experience poolAnd experience pool capacity->
The training round related variables comprise training round number Episode and updating the training round number Episode upnet The weight coefficient omega updated by the target network and the accumulated return discount factor gamma.
Further, step S44 includes the sub-steps of:
s441: initializing an experience poolSetting experience pool capacity +.>Weight coefficient omega updated by target network and rewarding function scaling coefficient alpha rw Accumulated return discount factor gamma, initial given supply +.>Sharing bicycle travel demand variablesAnd sharing bicycle slave eta 2 Go out and reach eta 3 Trip flow ratio +.>And based on the training round number Episode upnet Steps S442-S445 are performed in a loop;
s442: updating the shared bicycle operating environment when the policy execution state variable tr=0;
S444: updating the shared bicycle operating environment when the policy execution state variable tr=1;
s445: updating the next time step status of each schedulerPeaceAll act on->And updating the increased benefit ++of the scheduler after implementing the scheduling policy according to the reward function>
S446: based on the update process of steps S442-S445, a shared bicycle scheduling frame is constructed using the reinforcement learning algorithm, and the shared bicycle scheduling is completed using the shared bicycle scheduling frame.
Further, in step S442, a specific method for updating the shared bicycle operating environment when the policy execution state variable tr=0 is as follows: updating and calculating OD tag variable (eta) of travel of shared bicycle 2 ,η 3 ) Is a shared bicycle path flowη 5 The actual travel variable of the resulting shared bicycle +.>η 5 Is a shared bicycle with a shared actual attractionAnd the supply amount ++when the policy execution state variable tr=0>
In step S444, the specific method for updating the shared bicycle operating environment when the policy execution state variable tr=1 is: updating the computational behavior parameters a t The dispatching vehicle is from eta i,0 Pick up and put in reach eta i,1 Shared bicycle number variableAnd the supply amount ++when the policy execution state variable tr=1>
In step S446, the reinforcement Learning algorithm adopts a strategy gradient method or a Q-Learning method;
each dispatching truck comprises a strategy model and a value model; in the strategy gradient method, a strategy model of each dispatching truck comprises a strategy estimation network and a strategy target network; the strategy estimation network is used for inputting the state of the dispatching vehicle by constructing a neural networkThe output is scheduling policy->The strategy target network inputs the state of the schedule car in the next time step variable by constructing a neural network>Scheduling strategy for outputting next time step variable>
In the strategy gradient method and the Q-Learning method, a value model of each dispatching vehicle comprises a value estimation network and a value target network; the value estimation network is formed by constructing a neural network, and parameters of the neural network are as follows Inputting the state of a dispatcherScheduling policy->Average actions->Output Q value function Q i Wherein the Q function refers to a state in a reinforcement learning algorithmA state-action value function representing a cumulative prize value attained by the dispatcher; the value target network is formed by constructing a neural network, and the parameters of the neural network are +.>Inputting the state of the next time step variable of the dispatching truck>Scheduling policy for next time step variable +.>And the average action of the next time step variable +.>Output target Q value function +.>
In the Q-Learning method, in the strategy model of the dispatching vehicle, the method is based on the formula and />Probability sampling and selecting to obtain action-> wherein ,/>Representing t-1 time step i ne Average action of i ne Tag variable ω representing a different scheduler than scheduler i d Representing policy parameters->Is Q i Functional expression form,/->Representing an action probability calculation function, A i Representation->Action space set of (2) and according to +.>Update->Substitution formulaAccording to->Probability sampling and obtaining->Will->Act as final choice of policy model of the dispatching truck; />
If the reinforcement learning algorithm adopts a strategy gradient method, the reinforcement learning algorithm willStore to experience pool->And is from experience pool->A batch of samples is randomly sampled->According to sample->According to the accumulated return discount factor gamma and the loss function updating value, estimating the neural network parameters of the network, and updating the neural network parameters of the strategy model by using a gradient descent method; number of training rounds per interval Episode upnet In the method, parameters theta of the strategy model neural network are updated according to the weight coefficient omega updated by the target network i And parameters of the neural network of the value model +.>Neural network parameters respectively transferred to corresponding policy target networks +.>And neural network parameters of the value target network +.> wherein ,/>Representing global state s t+1 Global status representing next time step,/->Representing the prize value of the dispatcher +.>Represent the average action s t,j Representing the global state of the sampled sample s t+1,j Global state representing next time step of sampling a sample,/->Strategy for representing sampled samples, ++>Representing the average motion of the sampled samples, +.>Representation samplingA prize value for a dispatch car of samples;
if the reinforcement Learning algorithm adopts the Q-Learning method, the method willStore to experience pool->And is further from experience pool->A batch of samples is randomly sampled->In the sample->In the method, the neural network parameters of the value estimation network are updated according to the accumulated return discount factor gamma and the loss function; number of training rounds per interval Episode upnet And (3) transmitting the neural network parameters of the value model into the neural network parameters of the value target network according to the updated weight coefficient omega of the target network.
The beneficial effects of the invention are as follows:
(1) The shared bicycle scheduling optimization method based on reinforcement learning is beneficial to intelligently solving the problem of short-term and long-term scheduling optimization of shared bicycles of a large-scale road network under random and complex dynamic environments. The method does not need to predict the demand in advance or conduct manual data processing, and is not affected by the calculation efficiency and accuracy of demand prediction. And this method is not the optimal strategy for each time period, but is an overall optimization method for the entire scheduling process, which takes into account the supply and demand changes of the future time period and the influence of the scheduling decision on the supply and demand of the next time period.
(2) The dynamic optimization scheduling strategy provided by the invention can improve the scheduling operation efficiency. The travel amount and the utilization rate of the shared bicycle are improved, and the loss amount of the shared bicycle user demand is reduced. The shared bicycle idle rate in the road is reduced, and the number of idle vehicles with excessively high accumulation in certain areas is reduced. The waste of shared resources is reduced, and the phenomenon of urban environment caused by stacking of a large number of idle vehicles is improved.
(3) The actual travel amount of the sharing bicycle users is increased, so that the sharing rate in the connection traffic can be improved, and the operation efficiency of the public transportation system is improved. The service quality of the shared bicycle is improved, the shared bicycle is encouraged to replace a motor vehicle for traveling, urban congestion and motor vehicle exhaust emission are reduced, and social welfare is increased.
Drawings
FIG. 1 is a flow chart of a shared bicycle scheduling method;
fig. 2 is a graph of regional units based on equal hexagonal divisions.
Detailed Description
Embodiments of the present invention are further described below with reference to the accompanying drawings.
Before describing particular embodiments of the present invention, in order to make the aspects of the present invention more apparent and complete, abbreviations and key term definitions appearing in the present invention will be described first:
OD traffic: and the traffic volume between the starting and ending points is indicated. "O" is derived from the English ORIGIN and refers to the departure place of the trip, and "D" is derived from the English DESTINATION and refers to the DESTINATION of the trip.
MFMARL algorithm: mean Field Multi-Agent Reinforcement Learning, multi-agent reinforcement learning algorithm based on the average Field game theory.
As shown in fig. 1, the present invention provides a shared bicycle scheduling method based on deep reinforcement learning, comprising the following steps:
s1: dividing a dispatching area of the shared bicycle to obtain a dispatching area unit, and determining a running environment variable of the shared bicycle;
s2: determining a scheduling variable of the shared bicycle according to the running environment variable of the shared bicycle based on the scheduling area unit;
s3: constructing a vehicle dispatching optimization model of the shared bicycle according to dispatching variables of the shared bicycle;
s4: based on a vehicle dispatching optimization model of the shared bicycle, constructing a shared bicycle dispatching frame by utilizing an average field theory, and completing shared bicycle dispatching by utilizing the shared bicycle dispatching frame.
In the embodiment of the invention, in the sequential decision problem, the interactive influence of the supply and demand environment and the implementation scheduling strategy is considered, and the problem of dynamic scheduling optimization of the shared bicycle is solved. According to the scheduling optimization period duration and whether to consider placing surplus shared self-vehicles to a city fixed warehouse, the scheduling optimization problem can be divided into two: the problem of optimizing the shared bicycle schedule without a fixed warehouse and the problem of optimizing the shared bicycle schedule with a fixed warehouse are considered.
In the scheduling optimization problem of the shared bicycle, the optimization target is not the actual trip amount pursued to be maximized in a single time period, the scheduling high efficiency of a single scheduling vehicle is not pursued, and the global trip amount is maximized through dynamic scheduling strategy optimization with cooperation in the whole scheduling period. Further, after achieving the above objectives, it is contemplated herein that the scheduling policy includes an act of placing excess vehicles into the warehouse in the presence of a city warehouse to achieve a reduction in redundant bicycles in the road.
The invention constructs a scheduling optimization process of the shared bicycle, as shown in fig. 3. In the dynamic dispatching optimization process, the invention considers the renting, riding, parking and dispatching processes of the bicycle and the change conditions of supply and demand. At each time step, each scheduler picks up a certain number of shared bicycles from the unit where it is currently located and loads them into the scheduler's cabin, then the scheduler then travels to the arrival unit and places all the shared bicycles in the cabin at the arrival unit.
In the embodiment of the present invention, as shown in fig. 2, a specific method for scheduling a bicycle is as follows: dividing a scheduling area of a shared bicycle into a plurality of equilateral hexagons as scheduling area units, and defining a global tag variable eta of each scheduling area unit 5 A horizontal direction label variable m, and a vertical direction label variable h, which satisfy the following relation:
wherein ,η5 ∈M′,M′={0,1,...,((M+1) 2 -1) }, M represents the maximum value of the horizontal direction tag variable or the vertical direction tag variable of the scheduling area unit, M' represents the unit tag set of the scheduling area unit;
in step S1, the running environment variables of the shared bicycle comprise time variables, city fixed warehouse position set variables and supply parameters;
the time variable comprises a time step variable T, a time step variable set T and a time step maximum value variable T max Wherein T e T, t= {0,1, T max };
The city fixed warehouse location set variable comprises a fixed warehouse location set eta w ;
Some units in a city may be set as stationary warehouses for the city, and when scheduling measures are implemented, the scheduling vehicle may launch idle shared bicycles into the stationary warehouses for the city of that unit. There is no upper limit to the capacity of the urban stationary warehouse, and bicycles scheduled for vehicle delivery in the warehouse will no longer be shipped out and given to the rider for use. When the unit where the destination of the travel of the rider is located is the city fixed warehouse, the shared bicycle parked by the rider will not be placed in the city fixed warehouse, but remain in the unit, and can still be used by the rider in future time. The set of variable locations of the city fixed warehouse may include the unit locations of the city fixed warehouse within all areas.
The supply parameter includes a first supply coefficient c dis And a second supply coefficient c initial The method comprises the steps of carrying out a first treatment on the surface of the Wherein the first supply coefficient c dis The determining method of (1) comprises the following steps: calculating the demand value of each scheduling area unit in each time step variable according to the shared bicycle demand data, and taking 40 quantiles of the demand value in each 10 minutes in all scheduling area units as a first supply coefficient c dis The method comprises the steps of carrying out a first treatment on the surface of the Second supply coefficient c initial The determining method of (1) comprises the following steps: sharing bicycles within each dispatch area unitAnd a first supply coefficient c dis As the second supply coefficient c initial 。
It is assumed that the shared bicycle supply amount of each unit is uniformly distributed in the area at the initial time. In order to generalize the research of the influence of the supply quantity on the travel analysis, the invention does not directly give the number of the supply quantity, but determines the supply quantity value according to the relation between the supply quantity and the demand. The present invention defines a first supply coefficient c dis And taking 40 quantiles of the riding demand sequence of all units every 10min for the demand value of each unit calculated according to the demand data at each time step. Here, 40 quantiles are chosen and the mean is not chosen, since the mean of the data is more susceptible to extrema. The 40 quantiles are the 40% number after all the values are arranged from small to large. It can avoid the problem that the analysis result is not generalizable because of the few higher demands in the sequence of riding demands of all units in each time step.
The present invention defines a second supply parameter c initial For sharing the supply quantity of the bicycle and the first supply coefficient c in each unit at the initial time dis Is a ratio of the relationship of (2). In the present invention, the second supply parameter c initial Selected as five values, c initial E {20,50,100,200,500,1000}. The invention defines the supply amount of the shared bicycle in each unit as c when the initial time is defined dis And c initial And take its integer value down.
In the embodiment of the invention, in step S2, the scheduling variables of the shared bicycle include a policy execution state variable class, a supply and demand environment variable class, a riding travel variable class and a scheduling policy variable class;
the policy execution state variable class includes a policy execution state variable tr, where tr is {0,1};
at time step t, the supply and demand environment variable class includes shared bicycle travel demand variables of the dispatch area unitScheduling area unit when policy execution state variable tr=0Is a shared bicycle supply variable->(representing the number of shared bicycles available) and a shared bicycle supply variable for a dispatch area unit when the policy execution state variable tr=1(indicating the number of shared bicycles available for use);
at time step t, the riding trip variable class comprises a global tag eta of a scheduling area unit where an OD starting point of the shared bicycle trip is located 2 Global label eta of scheduling area unit where OD destination point of bicycle travel is shared 3 OD tag variable (eta) sharing bicycle travel 2 ,η 3 ) Sharing OD flow for bicycle travelSharing bicycle slave eta 2 Go out and reach eta 3 Trip flow ratio +.>η 5 The actual travel variable of the resulting shared bicycle +.> and η5 The actual attraction variable of the shared bicycle +.>
η 2 and η3 The conversion of the horizontal and vertical labels is the same as the label relation of the dispatch area unit; in units eta 2 As a starting unitThe sum of the ratios of (1) is expressed as a unit eta 3 Is +.>The sum of the ratios of (2) is 1; when eta 2 =η 5 In units of eta 2 Is +.>The sum is equal to the actual travel amount->When eta 3 =η 5 In units of eta 3 As a destination unitThe sum is equal to the actual suction amount->/>
At time step t, the scheduling strategy variable class comprises a scheduling vehicle tag set I, a scheduling vehicle tag variable I and a scheduling vehicle starting unit tag variableThe dispatcher reaches the unit tag variable->Variable set kappa of movement direction of dispatching truck 1 Scheduling ratio variable set κ 2 The dispatcher is from->Movement direction variable to adjacent six regular hexagons +.>Schedule ratio variable of the scheduler->Scheduling policy of the scheduler >Maximum capacity of the cabin of the dispatching vehicle>The dispatcher is from->Pick up and put in->Is a shared bicycle number variable->Scheduler arrival->And->Belongs to eta w The time scheduling car is put in eta i,1 The number of shared self-vehicles in the vehicle cabin being a ratio alpha of the number of vehicles in the vehicle cabin wh Eta in the case where a scheduling policy is expected to have been applied to a conventional scheduling vehicle before the scheduling vehicle applies the scheduling policy 5 An estimated cumulative increase amount variable +.>Increased benefit ∈after the scheduler enforces the scheduling policy>And the total amount Z of shared bicycles stored in the urban fixed warehouse at the end of the scheduled cycle time warehouse ;
Where i= {0,1,..n }, N represents the maximum value of the tag variable of the dispatching truck, I epsilon I kappa 1 ={0,1,...,5},κ 2 ={0,0.25,0.5,0.75},When->When the number is 0 to 5, the adjacent units respectively indicate that the dispatching truck moves to the left lower part, the right side, the left upper part, the left side, the right lower part and the right upper part, and the relation is as follows:
where M represents a horizontal direction tag variable of a scheduling area unit, h represents a vertical direction tag variable of the scheduling area unit, M' represents a unit tag set of the scheduling area unit, and T represents a time step variable set.
and ηi,1 The conversion of the horizontal and vertical labels is the same as the label relation of the dispatch area unit; the schedule ratio variable of the scheduler i may be four percentages, i.e. +. >Unit representing pick up of dispatcher i +.>Shared bicycle number is the unit at this time>Percentage of the number of supplied amounts +.>When cell eta 5 When the number of vehicles of the shared bicycle which is expected to be accumulated and tuned away is larger than the number of places, the accumulated increment amount is expected to be accumulated>Negative, whereas, when the number of vehicles of the shared bicycle for which cumulative lift-off is expected to be smaller than or equal to the number of places, the cumulative lift-off is expected to be increased by +.>Is non-negative.
In the embodiment of the present invention, in step S3, the vehicle scheduling optimization model of the shared bicycle specifically includes:
s.t.
in the vehicle dispatching optimization model, the income increased after the dispatching vehicles implement the dispatching strategyMaximizing objective function as a short-term schedule optimization problem for shared bicycles>The calculation formula is as followsWherein T represents a time step, T max Maximum variable representing time step, i representing the schedule tag variable, N representing the total number of schedule tag variables, +.>Representing a scheduling strategy of a scheduling vehicle; />
The present invention sets the benefits to maximize the benefits of sharing the bicycle compared to the situation without any scheduling policy. The decision variable is the action decision of the dispatching vehicleIncluding the direction of movement of the dispatcher->And scheduling ratio
When the time step variable t strategy execution state variable tr=0, the decision variable is an action decision Action decision->The calculation formula of (2) is +.> wherein ,/>Representing the slave eta of the dispatching vehicle i,0 A movement direction variable to six adjacent regular hexagons,/->A scheduling ratio variable representing a scheduling vehicle;
when the time step variable t policy execution state variable tr=0, the global tag variable η of the schedule area unit 5 And a global tag eta sharing a scheduling area unit where the OD starting point of the bicycle trip is located 2 Sharing OD tag variable (η) for bicycle travel at the same time 2 ,η 3 ) Is a shared bicycle path flowThe calculation formula of (2) isAnd is also provided withWherein INT (·) represents a down integer value, < >>Shared bicycle travel demand variable representing dispatch area unit, +.>Representing the shared bicycle supply variable in the dispatch area unit when the initial given supply is t=0,/>A shared bicycle supply variable in the dispatch area unit when representing the policy execution state variable tr=1,/->Representing a shared bicycle slave eta 2 Go out and reach eta 3 M' represents a unit tag set of the scheduling area unit;
supply of schedule area unit tag variable for policy enforcement status tr=0 +.>And the generated demand of schedule area unit tag variable +.>In (a) and (b)Smaller values. Path flow- >Tag variable eta for a scheduling area unit not greater than the reference 5 Actual travel quantity and eta 2 Is the starting point and eta 3 Travel flow rate for destination +.>Is an integer of the product of (a).
Sharing bicycle slave eta 2 Go out and reach eta 3 Is a ratio of the travel flow rate of (2)Satisfies the conservation relation between the path flow and the OD flow, and uses the global label eta of the scheduling area unit where the OD starting point of the shared bicycle travel is located 2 Trip flow ratio as origin +.>The sum of (2) is 1, and the calculation formula is +.>Wherein T represents a time step variable set, η 3 Global tags representing units where OD destination points of the shared bicycle travel are located;
according to the path flowWhen the global tag variable η of the schedule area unit is the time step t policy execution state tr=0 5 And a global tag eta sharing the unit where the OD starting point of the bicycle trip is located 2 When the same, the bicycle path flow is shared +.>Is taken as the global tag variable eta of the dispatch area unit 5 Is the actual travel amount of the shared bicycle +.>The calculation formula is +.>
When the time step variable t policy execution state variable tr=0, the global tag variable η of the schedule area unit 5 And a global tag eta sharing the unit where the OD destination of the bicycle trip is located 3 At the same time, the OD tag variable (eta) of the travel of the shared bicycle 2 ,η 3 ) Shared bicycle path flowIs taken as the global tag variable eta of the dispatch area unit 5 Is the actual attraction of the shared bicycle>The calculation formula is as follows
When the time step variable t policy execution state variable tr=0, the bicycle supply amount is sharedUpdating according to the renting and parked sharing bicycle number in the travel activity of the rider, wherein the calculation formula is as follows wherein ,/>A shared bicycle supply variable representing a policy execution state variable tr=1 after a scheduling policy has been applied at time step (t-1), a +.>Represents the time step eta of t 5 Is used for sharing the actual travel variable of the bicycle,/>representing t time step eta 5 Is used for sharing the actual attraction variable of the bicycle;
when the time step variable t policy execution state variable tr=0, the unit tag variable that the scheduler will arrive at (t+1) time stepThe calculation formula of (2) is Wherein m represents a horizontal direction tag variable of the scheduling area unit, h represents a vertical direction tag variable of the scheduling area unit,/o>A start unit tag variable representing the time step of the scheduler (t+1), +.>Representing the slave eta of the dispatching vehicle i,0 Moving to the moving direction variable of six adjacent regular hexagons;
when the time step variable t policy execution state variable tr=0, η 5 The estimated cumulative increase/decrease amount of the supply amount of (2)The calculation formula of (2) is wherein ,indicating that the (i-1) th dispatcher vehicle predicts the slave eta 5 Pick up shared number of bicycle, alpha wh Indicating arrival eta of the dispatcher i,1 And eta i,1 Belongs to eta w The time scheduling car is put in eta i,1 Is a ratio of the number of shared self-vehicles to the number of vehicles in the cabin, eta w Representing a fixed warehouse location set;
when the time step t strategy execution state tr=0, the unit eta after the scheduling is expected to be implemented by the front (i-1) scheduling vehicle 5 The estimated cumulative increase/decrease amount of the supply amount of (2)In cell eta 5 In (i-1) th dispatcher, the proposed implementation of the dispatching strategy does not involve the unit eta 5 I.e. +.>And->At this time, the cumulative increment amount +.>Is 0. If the (i-1) th dispatcher vehicle predicts the slave unit eta according to the formula 5 Pick up->Number of vehicles, unit eta 5 An estimated cumulative increase amount of the supply amount of +.>Decrease->If the (i-1) th dispatcher vehicle is expected to be placed +.>Number of vehicles to Unit eta 5 And unit eta 5 Position set eta of tag value not belonging to city fixed warehouse w Then element eta 5 An estimated cumulative increase amount of the supply amount of +.>Add->If the (i-1) th dispatcher vehicle is expected to be placed +.>Number of vehicles to Unit eta 5 And unit eta 5 The tag value belongs to the position set eta of the city fixed warehouse w Then element eta 5 An estimated cumulative increase amount of the supply amount of +.>Add->Position set eta of city fixed warehouse w When empty, the default is to disregard the situation of the urban fixed warehouse.
At time step t policy execution state variable tr=0, the dispatcher vehicle is driven from η i,0 Will beThe shared bicycle of the vehicle is picked up and put into the cabin of the dispatching vehicle, and the +.>The shared bicycles of the vehicles are all put in eta i,1 In the number of vehicles picked up by the dispatching vehicleThe calculation formula of (2) isAnd is also provided withWherein, min (·) represents taking the minimum, < ->Represents the supply amount, η when the policy execution state variable tr=0 i,0 A start unit tag variable representing a dispatcher, +.>Representing the maximum capacity of the cabin of the dispatcher vehicle, +.>A scheduling ratio variable representing a scheduling vehicle;
indicating->Requiring pick-up to account for the current cell->Is->Percentage number of bicycles. />Means +.A means for indicating a time state tr=0 after scheduling is performed by the (i-1) scheduler before the time step t>Is used for the remaining supply of (a). Shared bicycle number pickup vehicle number +.>The number of vehicles that should be picked up based on the dispatch strategy, the remaining supply and the vehicle warehouse capacity +.>And is an integer. The whole formula is +.>Is a non-negative constraint.
When the time step t strategy execution state variable tr=1, the number of vehicles picked up according to the dispatching vehicle Executing the scheduling policy and updating eta 5 Obtaining eta after implementing the scheduling policy 5 Is a shared bicycle supply variable->The calculation formula is as follows
When the formula shows that the time step t time state tr=1, the dispatching truck i implements dispatchingRear unit eta 5 Supply amount of->In cell eta 5 In the dispatching vehicle i, the implementation of the dispatching strategy does not relate to the unit eta 5 I.e. +.>And->At the time of supply +.>Is unchanged. If the dispatching vehicle i is from unit eta 5 Pick up->Number of vehicles, unit eta 5 Supply amount of->Decrease->If the dispatching truck i is placedNumber of vehicles to Unit eta 5 And unit eta 5 Position set eta of tag value not belonging to city fixed warehouse w Then element eta 5 Supply amount of->Add->Conversely when cell eta 5 The tag value belongs to the position set eta of the city fixed warehouse w When the cell eta 5 Supply amount of->Add->And->The number of shared bicycles is placed to the unit eta by default 5 Is a city fixed warehouse.
Total amount Z of shared bicycles stored in urban fixed warehouse warehouse The calculation mode of (a) is that
The shared bicycle schedule optimization problem assumes that there are two conditions. Assume condition one: the invention assumes each dispatching truck and sequentially implements a dispatching strategy according to the dispatching truck number. The cyclist rents the shared bicycle according to the current supply quantity of the shared bicycle in the current unit, and the decision maker is based on the supply and demand after the trip is completed The environment formulates a scheduling policy, then implements the scheduling policy and updates the supply and demand environment. The policy execution state variable tr divides the time step into two time states, namely, a time state for updating the travel of the rider and formulating the scheduling policy, and a time state for implementing the scheduling policy. When tr=0, the rider rents, uses and parks the shared bicycle, updates the supply and demand change of each unit of the time step t, and then generates a scheduling strategy based on the supply and demand conditions after traveling; when tr=1, the scheduling policy is implemented and the supply and demand environment under the influence of the scheduling policy is updated. Assume condition two: to ensure that the dispatch vehicle will not travel to an area outside the area, the present invention assumes that the dispatch vehicle will stay in the current location when this occurs. I.e. when the dispatching vehicle arrives at the zone outside the zone at time step t+1 according to the dispatching strategy, the dispatching strategy will be updated to the unit for the dispatching vehicle to arrive at time step t+1For the unit where time step t is located ∈ ->Under this assumption, in policy +.>Down (S)>And->The relationship needs to be satisfied at the same time>And
in the shared bicycle scheduling optimization problem, the length of the scheduling period can be controlled by the setting of the time step set.The invention sets the scheduling period of the shared bicycle short-term scheduling optimization problem as one day, namely T max Time set t= {0,1,..143 }. In the conventional scheduling method, the scheduling period is typically one day. In practice, however, the problems of uneven distribution and loss of demand of the shared bicycle become more serious with the increase of time, due to the limited number of scheduled vehicles. Especially at the later stages of the operation process, it is more challenging to formulate an effective strategy because the shared bicycle distribution is more unbalanced. Aiming at the problem of long-term operation scheduling, the invention also considers the problem of long-period dynamic scheduling optimization of the shared bicycle. The invention defines the dispatching cycle of the shared bicycle long-term dispatching optimization problem to be 7 days, T max Time set t= {0,1,..1007 }.
The present invention is expected to further reduce the number of sharing bicycles that are excessively idle on the urban road, on the basis of satisfying the goal of increasing the travel amount of the sharing bicycles as much as possible. Thus, the invention provides a dynamic scheduling optimization problem of the shared bicycle comprising the urban warehouse. In this problem, the present invention assumes that there is a stationary warehouse in the city and that excess bicycles can be stored. During a dispatch operation, the dispatcher may move to a unit built in the warehouse and place the stored self-vehicles within the bins into the city warehouse. Wherein, when the city is fixed, the position set eta of the warehouse w When the method is empty, the problem of optimizing the shared bicycle dispatching is converted into the condition that city fixing is not considered, namely, the dispatching truck cannot put redundant shared bicycle into a city fixing warehouse in the dispatching process. Conversely, when the city is fixed, the position set eta of the warehouse w When the bicycle is not empty, the problem of shared bicycle scheduling optimization is that a city fixed warehouse exists by default, and redundant idle shared bicycle can be stored.
In the shared bicycle scheduling optimization problem, a policy execution state variable tr epsilon {0,1}, a scheduling car label set i= {0, 1..once, N }, a moving direction variable set κ of a scheduling car 1 = {0,1,..5 }, schedule ratio variable set κ 2 = {0,0.25,0.5,0.75}, set of unit tags M' = {0,1, (|m+1| 2 -1) }. Root of Chinese characterAccording to the variable definition of the shared bicycle short-term dispatching optimization problem, under the condition of considering dispatching strategy implementation, the constraint condition of the actual unit trip amount and the actual attraction amount can ensure the conservation of the shared bicycle trip flow of each unit.
In the constructed short-term scheduling optimization problem of the shared bicycle, the objective function is to maximize the increased total trip amount of the shared bicycle in the area through which the scheduling vehicle passes, compared to the case where no scheduling policy is performed. The decision variables are action decisions of the dispatcher, including the moving direction of the dispatcher to the unit and the number of bicycles to be dispatched. The constraint conditions are conservation of the total number of shared bicycles, conservation of the relation between the riding travel path flow and the riding OD flow, and non-negative and integer constraint of the flow in the scheduling process. When travel demands are generated in the cell that are greater than the available shared bicycle in the cell, the excess demand will be considered a lost demand.
In an embodiment of the present invention, step S4 comprises the sub-steps of:
s41: determining elements of a shared bicycle scheduling framework based on a vehicle scheduling optimization model of the shared bicycle;
s42: determining average actions by using a one-hot coding mode;
s43: defining experience pool variables and training round related variables of a shared bicycle scheduling framework;
s44: based on the average field theory, the shared bicycle scheduling frame is constructed according to the elements, the average actions, the experience pool variables and the training round related variables of the shared bicycle scheduling frame.
According to the invention, a shared bicycle scheduling framework based on multi-agent reinforcement learning of an average field theory is provided according to the provided shared bicycle scheduling optimization problem, and the goal is to enable the agents to change the learning riding demands, adapt to a random dynamic environment, realize cooperative dynamic decision optimization and increase riding travel.
In the embodiment of the present invention, in step S41, the present invention combines the shared bicycle transfer process model and the multi-intelligent reinforcement learning algorithm to constructA vehicle dispatch model for a shared bicycle. The invention defines I as an agent label set and is equivalent to a label set of a dispatching vehicle, S is a state set and A i Representation ofP is a transition probability function, R is a reward function and γ is a discount factor. The MDP-based reinforcement learning model includes six elements: g= (I, S, a, P, R). Where I represents the dispatcher tag variable and the tag variable of the agent in the equivalent reinforcement learning algorithm, I e i= {0, 1..n }.
Elements of the shared bicycle scheduling framework include statusBehavior parameter a t And a bonus function, wherein, representing the state of the dispatcher at the time step variable t,a scheduling strategy of a scheduling vehicle when the time step variable t is represented;
the rewarding function comprises a rewarding function for actually increasing the travel amount of the dispatching truckAverage increase trip amount rewarding function of dispatching vehicle>And the dispatching truck globally increases the travel function +.>The specific formula is as follows: />
wherein ,αrw Representing the scaling factor of the bonus function,indicating +.>Is to go out and go in->Indicating +.>Is to go out and go in->Indicating +.>Is used for the actual travel amount of the vehicle,indicating +.>Is to go out and go in->Represents the time step variable t +.>Number of inner dispatch vehicles>Represents the time step variable t +.>Number of inner dispatch vehicles >Representing η when implementing a scheduling policy 5 Is to go out and go in->Represents eta when no scheduling policy is implemented 5 N represents the maximum value of the tag variable of the dispatching vehicle, eta 5 The global tag representing each scheduling region unit, M' represents the set of unit tags for the scheduling region unit.
Status ofIn (I)>The state of the dispatching vehicle when the time step variable t is represented; the present invention assumes a state +.>Comprising the supply amount of the cell in which the agent i is located and the location number of the cell, i.e., satisfying +.>
The behavior parameter refers to the joint action of the scheduling policy for scheduling vehicles at time t and satisfies a t ∈A=A 0 ×A 1 ×...×A N . A isSpatial set A of (2) i Is defined in the set of vectors of (a). />The action strategy of agent i is equal to the dispatching strategy of dispatching car, namely +.>
Agent i refers to each dispatch vehicle tag in the city, r t i The method is an instant evaluation of states and generated actions given by the environment in the interaction process of the intelligent agent i and the environment. The goal of agent i is to find the maximized rewardsThe present invention contemplates three ways of rewarding functions.
The rewarding function refers to an instant evaluation of the status and the generated actions given by the environment in the process of interaction of the agent i with the environment. The goal of agent i is to find the maximum prize value. Based on the shared bicycle scheduling problem, the variables used in the calculation of the defined reward function of the present invention are as follows:
α rw -a bonus function scaling factor, dimensionless;
in the shared bicycle scheduling problem, the present invention considers three types of reward functions asMiddle->Selectable reward function, ++>
(1) Increased trip bonus function obtained by the agent: the invention defines an increased trip amount (Increased Trip Production Obtained by Agent, PA) rewarding function obtained by the intelligent agent, which is referred to as PA rewarding function, consisting ofAnd (3) representing. Which represents the increased travel volume of the shared bicycle obtained by each agent after the action is performed by that agent. In the PA bonus function, the rewards within the cell that an agent moves through are all considered rewards that the agent gets. The setting of the PA reward function may result in the agent focusing on certain unit schedules.
(2) Average incremental trip bonus function obtained by agent: the invention defines an average increased travel volume (Average Increased Trip Production Obtained by Agent, APA) reward function obtained by the agent and is named APA reward function, consisting of And (3) representing. The APA reward function refers to the average incremental amount of travel of the shared bicycle that each agent obtains after the agent performs an action. />Defined as the unit eta through which the scheduled vehicle passes i,0 and ηi,1 Execute scheduling policy->The resulting increase in average stroke yield.
(3) Globally increased travel amount obtained by the agent: the invention defines a globally increased trip amount (Average Increased Trip Production Obtained by an Agent of Total Units, APTU) rewarding function obtained by the agent and is named as APTU rewarding function, which is composed ofAnd (3) representing. The APTU reward function refers to the total area of increased travel of the shared bicycle that all agents acquire after performing the joint action.
The state transition probability refers to the state of each agent that will be updated according to the joint actions and environmental interactions performed by the agent as the time step advances backward.
In the embodiment of the present invention, in step S42, the specific method for determining the average action is as follows: method for rewriting scheduling policy of scheduling vehicle by using one-hot coding modeGet average action +.>The calculation formulas are respectively as follows:
wherein ,a variable denoted 0 or 1 ρ dim Dimension representing scheduling policy, N represents the maximum value of the variable of the tag of the scheduling car,/for the maximum value of the variable of the tag of the scheduling car >Representing i ne Action policy of i ne A tag variable indicating a different scheduler than the scheduler i.
The invention rewrites the moving direction according to the one-hot coding modeAnd administration ratio->Motion vector of (a) The actions are averaged for agent i when considering the remaining agent actions.
Combined action a based on traditional multi-agent deep reinforcement learning algorithm t Satisfy the following requirementsJoint action a t Has a dimension of (n+1) ρ dim Which expands as the number of agents increasesThe problems of network complexity, calculation efficiency reduction, strategy optimization effect reduction and the like of reinforcement learning are caused.
However, in the MFMARL algorithm, the joint average actionIs p dim ,/>Is p dim ,Middle action->Is 2 p dim . Therefore, the dimension of the joint action can be controlled according to the joint action processed in the MF mode, and the calculation efficiency is ensured. In particular, in a complex simulation context, the number of agents is typically a large value, and the use of joint average actions based on MF theory may alleviate the problem of dimensional explosions of joint actions caused by the increase in the number of agents.
In an embodiment of the present invention, in step S43, the experience pool variables of the shared bicycle schedule frame include an experience poolAnd experience pool capacity- >
The training round related variables comprise training round number Episode and updating the training round number Episode upnet The weight coefficient omega updated by the target network and the accumulated return discount factor gamma.
In an embodiment of the present invention, step S44 includes the sub-steps of:
s441: initializing an experience poolSetting experience pool capacity +.>Weight coefficient omega updated by target network and rewarding function scaling coefficient alpha rw Accumulated return discount factor gamma, initial given supply +.>Shared bicycle travel demand variable +.>And sharing bicycle slave eta 2 Go out and reach eta 3 Trip flow ratio +.>And based on the training round number Episode upnet Steps S442-S445 are performed in a loop;
s442: updating the shared bicycle operating environment when the policy execution state variable tr=0;
S444: updating the shared bicycle operating environment when the policy execution state variable tr=1;
s445: updating the next time step status of each schedulerAverage actions->And updating the increased benefit ++of the scheduler after implementing the scheduling policy according to the reward function>
S446: based on the update process of steps S442-S445, a shared bicycle scheduling frame is constructed using the reinforcement learning algorithm, and the shared bicycle scheduling is completed using the shared bicycle scheduling frame.
Stable environment improvement of multi-agent: during the training process of the multi-agent environment, the strategy of agent i is constantly changing. For agent i, the policy of other agents is constantly changing, so that agent i is in an unstable environment. At this time, the transition probability from the state at one time to the state at the next time cannot be ensured to be a stable value. I.e. for any oneWill be present in an unstable environment>In the case of (a) the (b), and />Respectively represent t 1 Time sum t 2 Policy of agent i at moment, +.>Andrespectively at time t 1 And time t 2 State transition probability of agent i.
If the intelligent agent i knows the action content of all the intelligent agents in the reinforcement learning process, the environment where the intelligent agent i is located can be changed into a stable environment. Since policies can be expressed by actions and states, when the states and action contents of all agents are known, at time t 1 And time t 2 The state transition probabilities of the agent i satisfy the following formulas, respectively:
thus, the first and second substrates are bonded together, and />It can be considered as policy independent. When the policy of the agent is changing continuously, the transition probability from the state at one moment to the state at the next moment still has stationarity, i.e. the following formula is still true:
Thus, for any given model of focused actionsThe agent i environment can be improved to a stable environment as shown in the following formula: />
In the embodiment of the present invention, in step S442, a specific method for updating the shared bicycle operating environment when the policy execution state variable tr=0 is: updating and calculating OD tag variable (eta) of travel of shared bicycle 2 ,η 3 ) Is a shared bicycle path flowη 5 The actual travel variable of the resulting shared bicycle +.>η 5 The actual attraction variable of the shared bicycle +.>And the supply amount ++when the policy execution state variable tr=0>
In step S444, the specific method for updating the shared bicycle operating environment when the policy execution state variable tr=1 is: updating the computational behavior parameters a t The dispatching vehicle is from eta i,0 Pick up and put in reach eta i,1 Shared bicycle number variableAnd the supply amount ++when the policy execution state variable tr=1>
In step S446, the reinforcement Learning algorithm adopts a strategy gradient method or a Q-Learning method;
each dispatching truck comprises a strategy model and a value model; in the strategy gradient method, a strategy model of each dispatching truck comprises a strategy estimation network and a strategy target network; the strategy estimation network is used for inputting the state of the dispatching vehicle by constructing a neural network The output is scheduling policy->The strategy target network inputs the state of the schedule car in the next time step variable by constructing a neural network>Scheduling strategy for outputting next time step variable>
In the strategy gradient method and the Q-Learning method, a value model of each dispatching vehicle comprises a value estimation network and a value target network; the value estimation network is formed by constructing a neural network, and parameters of the neural network are as followsInputting the state of a dispatcherScheduling policy->Average actions->Output Q value function Q i Wherein the Q-value function refers to a state-action value function in a reinforcement learning algorithm that represents a cumulative prize value attained by the scheduler; the value target network is formed by constructing a neural network, and the parameters of the neural network are +.>Inputting the state of the next time step variable of the dispatching truck>Scheduling policy for next time step variable +.>And the average action of the next time step variable +.>Output target Q value function +.>Evaluation network and target network of value model, in calculating Q i and />The running mode of forward propagation is adopted, the backward propagation mode is adopted when the parameters are updated, and the calculation mode of the strategy model is adopted.
In the Q-Learning method, in the strategy model of the dispatching vehicle, the method is based on the formula and />Probability sampling and selecting to obtain action- > wherein ,/>Representing t-1 time step i ne Average action of i ne Tag variable ω representing a different scheduler than scheduler i d Representing policy parameters->Is Q i Functional expression form,/->Representing an action probability calculation function, A i Representation->Action space set of (2) and according to +.>Update->Substitution formulaAccording to->Probability sampling and obtaining->Will->Act as final choice of policy model of the dispatching truck; />
For the reinforcement Learning algorithm based on Q-Learning, the policy model of each agent is based on the Q of agent i i The value gets the actionAnd does not include a policy objective model. The value model of each agent is divided into its value estimation network and a value target network consistent with the value estimation network structure. Estimating network input global state s of value model t Action of last time step->And the average action of the last time step +.>Output Q i Values. The value target network input layer is the global state value s of the next moment t+1 Action value->Average action value ∈ ->And output as +.>Values. Evaluation network and target network of value model, in calculating Q i and />The forward propagation mode is used when the parameters are updated, and the backward propagation mode is used when the parameters are updated.
In the framework, each agent independently contains a policy model for decentralized action execution and a value model containing centralized state action information. The reinforcement Learning algorithm can be divided into two types, namely a strategy gradient method and a Q-Learning method.
If the reinforcement learning algorithm adopts a strategy gradient method, the reinforcement learning algorithm willStore to experience pool->And is from experience pool->A batch of samples is randomly sampled->According to sample->According to the accumulated return discount factor gamma and the loss function updating value, estimating the neural network parameters of the network, and updating the neural network parameters of the strategy model by using a gradient descent method; number of training rounds per interval Episode upnet In the method, parameters theta of the strategy model neural network are updated according to the weight coefficient omega updated by the target network i And parameters of the neural network of the value model +.>Neural network parameters respectively transferred to corresponding policy target networks +.>And neural network parameters of the value target network +.> wherein ,/>Representing global state s t+1 Global status representing next time step,/->Representing the prize value of the dispatcher +.>Represent the average action s t,j Representing the global state of the sampled sample s t+1,j Global state representing next time step of sampling a sample,/->Strategy for representing sampled samples, ++>Representing the average motion of the sampled samples, +.>A prize value representing a dispatcher of the sampled samples;
if the reinforcement Learning algorithm adopts the Q-Learning method, the method willStore to experience pool->And is further from experience pool->A batch of samples is randomly sampled->In the sample- >In the method, the neural network parameters of the value estimation network are updated according to the accumulated return discount factor gamma and the loss function; number of training rounds per interval Episode upnet And (3) transmitting the neural network parameters of the value model into the neural network parameters of the value target network according to the updated weight coefficient omega of the target network.
According to the embodiment of the invention, a shared bicycle scheduling framework based on multi-agent reinforcement learning of an average field theory is provided according to the provided shared bicycle scheduling optimization problem, and the aim is to enable an agent to change the riding demand, adapt to a random dynamic environment, realize cooperative dynamic decision optimization and increase riding travel. The basic idea of frame construction is as follows:
(1) Feasibility of solving shared bicycle scheduling problem by using reinforcement learning algorithm framework
In the constructed shared bicycle scheduling optimization problem, the information of the state of the supply amount at the current moment is known, and the history information of the past time is not needed. That is, the state of the supply amount at the present time is related to only the state of the supply amount at the previous time and the policy action executed, and is independent of the state of the supply amount at the other time and the decision action condition. Therefore, it can be considered that the state of the supply amount at this time sharing the bicycle schedule optimization problem has markov. The shared bicycle schedule optimization problem satisfies the assumption of markov, i.e. the current state contains all information.
Thus, the shared bicycle scheduling problem may be converted into a Markov decision process. The Markov decision process can be solved by a reinforcement learning framework, so that it is feasible to solve the shared bicycle scheduling optimization problem with the reinforcement learning framework. The reinforcement learning does not need label data, and can realize the self-learning of the high-dimensional mapping relation from the state without a model to the action. In view of the advantages of the reinforcement learning algorithm, the invention provides a shared bicycle scheduling optimization framework based on the reinforcement learning algorithm.
(2) Solving the problem existing in the shared bicycle scheduling problem based on the reinforcement algorithm framework
When the multi-agent reinforcement learning algorithm is used for solving the problem of sharing bicycle scheduling control, the problems of reduced learning effect, repeated scheduling of vehicles scheduled in the same area and the like can occur, as shown below.
First, multi-agent settings are established in a traditional reinforcement learning algorithm, and for each agent, the strategy is constantly changing, resulting in an unstable environment. Unstable environments violate markov state transition stability, causing policy estimation errors and reducing policy optimization efficiency or policy optimization failure.
In the DQN algorithm, first, its agent i learns and selects the best strategy by means of a separate Q-learning algorithm. However, in a multi-agent environment, agent i updates the policies independently during learning, and does not consider the policies of other agents. I.e. the environment to which any agent i belongs is a non-stationary environment. At different times t, probability of state transition And not necessarily a stable value. However, the environment where the Q-learning algorithm converges proves requires that the state transition probability matrix must have a certain stability. The case of an unstable environment is contrary to this assumption. Second, in the conventional DQN algorithm, the agent i performs a random sampling operation on the data in the experience pool, and the sampled sample is used as a batch of training data of the neural network. The method can avoid the problem of state correlation in the original data. However, in a multi-agent environment, agent i may be an ineffective strategy in the state of the next unstable environment in the current state optimization strategy. The single agent DQN algorithm is an inefficient learning process for learning processes where invalid policy samples exist and increases the likelihood of failure of the empirical playback function. Therefore, in a multi-agent environment, the conventional DQN algorithm cannot guarantee that the value function can converge to the optimal value function.
In a gradient strategy algorithm, in an unstable environment, as the number of agents increases, the variance produced by the algorithm will also increase and the probability of the strategy being optimized in the correct direction will decrease.
Second, in the multi-agent deep reinforcement learning problem, agents are affected by the environment and other agents. During learning, the dimension of the joint action of the state-action value function will exponentially expand as the number of agents increases. Each agent estimates its own value function according to the joint strategy, and when the joint action space is large, the learning efficiency and learning effect will be reduced.
Third, if the estimation network of the reinforcement learning algorithm has low sensitivity to the input variable data, if the agent is affected by a higher reward value or a higher penalty value, the agent learns the weight of the parameters corresponding to the state and the action. This results in a further reduction in the sensitivity of the agent to changes in status and motion information and re-selecting the same motion. For a single agent, the case of too single action selection will increase the content monotonicity of the learning sample data, affecting the accuracy of the neural network fit sum in the estimated network. Inaccurate estimation of the scheduling vehicle strategy by the reinforcement learning algorithm will result in reduced scheduling efficiency.
Fourth, in the reinforcement learning algorithm of multiple agents, a phenomenon that multiple dispatching vehicles carry and share bicycle vehicles in the same cell occurs in the joint action. Non-cooperative scheduling strategies will result in scheduling inefficiencies, even in situations where certain units are over-scheduled and no vehicles are available for borrowing or sharing bicycle vehicles are over-piled.
(3) Basic idea of frame construction
According to the problems, the shared bicycle scheduling optimization framework based on multi-agent deep reinforcement learning needs to consider the dimension explosion problem caused by the number of agents in a stable environment, and solves the problem of high-efficiency cooperative cooperation among the multi-agents. The frame construction idea is as follows:
First, the agent learning structure of frame:
in the multi-agent deep reinforcement learning method based on the distributed structure, the reinforcement learning method may be classified into group reinforcement learning and independent reinforcement learning according to whether an agent considers state and behavior information of the agent.
If each agent in the multi-agent system can be regarded as an independent single agent without communication capability, that is, the agent does not consider the policy selection of other agents in the policy selection process, the algorithm can be called an independent learning type algorithm. In this case, shared information data can be obtained only between the agents by the collective communication method after the information is fed back by the external environment. Conversely, in the algorithm of the group learning category, multiple agents will be considered as a combined group, each agent also considering the policy choices of other agents during the learning process.
The independent learning method can avoid the problem of dimension explosion in the communication process caused by the increase of the number of the intelligent agents, and the method can reference the reinforcement learning algorithm in the static environment, but has the defects of slow convergence speed and long learning time. The group learning method has the advantages that the intelligent agents can fully communicate with each other to realize full cooperation, but the search space is large and the learning time is long in learning. In order to realize cooperation and communication among agents, strategies of other agents are considered in learning, and a group reinforcement learning method is adopted to construct a multi-agent-based deep reinforcement learning algorithm.
Second, the unstable environment of the frame improves:
for the problem of unstable environment improvement of multiple agents, if the agent i learns the action content of all agents in the learning process, the environment where the agent i is located can be changed into a stable environment. The status and activity content of the agent are set as known information herein to improve the unstable environment.
Third, the dimensional explosion problem caused by the increase in the number of agents in the framework improves:
average field gambling theory (Mean Field Theory, MFT) studies differentiated gambling of group objects consisting of rational gambling parties. The state of the other agents is still considered while the agents consider their own state. Classical cases of average field gaming are training shoals to swim in a collaborative manner. The fish do not pay attention to the swimming behavior of each fish in the population, but rather adjust their own behavior with the behavior of the shoal in the neighborhood. Average field gaming theory can describe the behavioral response of surrounding agents and the behavioral set of all agents by the Hamilton-Jacobi-Bellman equation and the Fokker-Planck-Kolmogorov equation. The Mean Field game theory-based Multi-agent reinforcement learning algorithm (Mean Field Multi-Agent Reinforcement Learning, MFMARL) algorithm assumes that the impact of all other agents on a certain agent can be represented by a Mean distribution. The MFMARL is suitable for the reinforcement learning problem of large-scale agents, and can simplify the interactive calculation among the agents. MFMARL solves the problem of expansion of the space of the value function due to an increase in the number of agents. Thus, MFMARL is incorporated herein in a shared bicycle scheduling framework and defines that each agent has the same discrete action space.
Fourth, the sensitivity of the reinforcement learning algorithm to state and motion information changes is improved in the framework:
in order to improve over-learning stability in a multi-agent deep reinforcement learning framework and improve the sensitivity of a neural network to state and action information changes, the framework adopts a one-hot coding mode as input of the neural network, and the reward function value is scaled and then processed by a hyperbolic tangent tanh (·) function.
Fifth, the framework improves the high-efficiency cooperative ability among the agents:
in order to improve the high-efficiency cooperative capability among the agents, the framework designs the action content of all the agents to be known in the learning process of the agent i, namely, the states and strategies of other agents can be learned. Furthermore, different forms of reward functions in the framework are discussed herein, the impact of which on the collaborative capability is studied.
Therefore, the invention aims at the problem of shared bicycle scheduling control, improves the stability of the learning environment of the multi-agent, and constructs the shared bicycle scheduling framework of multi-agent reinforcement learning based on the average field theory for multi-agent group learning.
The working principle and the working process of the invention are as follows: the invention aims to establish a general framework for shared bicycle scheduling based on multi-agent deep reinforcement learning of average field theory, so as to solve the shared bicycle scheduling problems of long-term scheduling process, dynamic environment and large-scale network. The stability of state transition of the multi-agent deep reinforcement learning algorithm, dimensional explosion, agent communication efficiency and agent exploration behaviors are considered. And a framework of a reinforcement learning algorithm is adopted, and a coordinated and effective dynamic strategy is obtained in a high-dimensional action space, so that travel requirements are met, and idle shared bicycles in a road are reduced. And combining reinforcement learning basic theory and shared bicycle scheduling system research, defining the division of area units, and constructing a shared bicycle scheduling optimization model.
Aiming at a high-dimensional multi-body action space, a shared bicycle scheduling frame for multi-agent deep reinforcement learning based on an average field theory is provided. The framework can be used to address long-term scheduling, dynamic environments, large-scale and complex networks. The framework does not need to predict the demand in advance or process the data, and is not affected by the calculation efficiency and accuracy of the demand prediction. And the framework is not the best strategy for each time period, but rather is an overall optimization of the overall scheduling process that takes into account supply and demand variations for future time periods and the impact of scheduling decisions on supply and demand for the next time period.
The beneficial effects of the invention are as follows:
(1) The shared bicycle scheduling optimization method based on reinforcement learning is beneficial to intelligently solving the problem of short-term and long-term scheduling optimization of shared bicycles of a large-scale road network under random and complex dynamic environments. The method does not need to predict the demand in advance or conduct manual data processing, and is not affected by the calculation efficiency and accuracy of demand prediction. And this method is not the optimal strategy for each time period, but is an overall optimization method for the entire scheduling process, which takes into account the supply and demand changes of the future time period and the influence of the scheduling decision on the supply and demand of the next time period.
(2) The dynamic optimization scheduling strategy provided by the invention can improve the scheduling operation efficiency. The travel amount and the utilization rate of the shared bicycle are improved, and the loss amount of the shared bicycle user demand is reduced. The shared bicycle idle rate in the road is reduced, and the number of idle vehicles with excessively high accumulation in certain areas is reduced. The waste of shared resources is reduced, and the phenomenon of urban environment caused by stacking of a large number of idle vehicles is improved.
(3) The actual travel amount of the sharing bicycle users is increased, so that the sharing rate in the connection traffic can be improved, and the operation efficiency of the public transportation system is improved. The service quality of the shared bicycle is improved, the shared bicycle is encouraged to replace a motor vehicle for traveling, urban congestion and motor vehicle exhaust emission are reduced, and social welfare is increased.
Those of ordinary skill in the art will recognize that the embodiments described herein are for the purpose of aiding the reader in understanding the principles of the present invention and should be understood that the scope of the invention is not limited to such specific statements and embodiments. Those of ordinary skill in the art can make various other specific modifications and combinations from the teachings of the present disclosure without departing from the spirit thereof, and such modifications and combinations remain within the scope of the present disclosure.
Claims (6)
1. The shared bicycle scheduling method based on deep reinforcement learning is characterized by comprising the following steps of:
s1: dividing a dispatching area of the shared bicycle to obtain a dispatching area unit, and determining a running environment variable of the shared bicycle;
s2: determining a scheduling variable of the shared bicycle according to the running environment variable of the shared bicycle based on the scheduling area unit;
s3: constructing a vehicle dispatching optimization model of the shared bicycle according to dispatching variables of the shared bicycle;
s4: based on a vehicle dispatching optimization model of the shared bicycle, constructing a shared bicycle dispatching frame by utilizing an average field theory, and completing shared bicycle dispatching by utilizing the shared bicycle dispatching frame;
in the step S1, the specific method for dividing the scheduling area of the shared bicycle is as follows: dividing a scheduling area of a shared bicycle into a plurality of identical equilateral hexagons as scheduling area units, and defining a global tag variable eta of each scheduling area unit 5 A horizontal direction label variable m, and a vertical direction label variable h, which satisfy the following relation:
wherein ,η5 ∈M′,M′={0,1,...,((M+1) 2 -1) }, M represents the maximum value of the horizontal direction tag variable or the vertical direction tag variable of the scheduling area unit, M' represents the unit tag set of the scheduling area unit;
In the step S1, the running environment variables of the shared bicycle comprise a time variable and a city fixed warehouse position set variable;
the time variable comprises a time step variable T, a time step variable set T and a time step maximum value variable T max Wherein T e T, t= {0,1, T max };
The city fixed warehouse location set variable comprises a fixed warehouse location set eta w ;
In the step S2, the scheduling variables of the shared bicycle include a policy execution state variable class, a supply and demand environment variable class, a riding trip variable class and a scheduling policy variable class;
the policy execution state variable class comprises a policy execution state variable tr, wherein tr is {0,1};
at time step t, the supply and demand environment variable class comprises a shared bicycle travel demand variable of a dispatching area unitShared bicycle supply variable ++of dispatch area unit when policy execution state variable tr=0>And a shared bicycle supply variable of the schedule area unit when the policy execution state variable tr=1 +.>
At time step t, the riding trip variable class comprises a global tag eta of a scheduling area unit where an OD starting point of the shared bicycle trip is located 2 Global label eta of scheduling area unit where OD destination point of bicycle travel is shared 3 OD tag variable (eta) sharing bicycle travel 2 ,η 3 ) Sharing OD flow for bicycle travelSharing bicycle slave eta 2 Go out and reach eta 3 Trip flow ratio +.>η 5 The actual travel variable of the resulting shared bicycle +.> and η5 The actual attraction variable of the shared bicycle +.>
At time step t, the scheduling strategy variable class comprises a scheduling vehicle tag set I, a scheduling vehicle tag variable I and a scheduling vehicle starting unit tag variableThe dispatcher reaches the unit tag variable->Variable set kappa of movement direction of dispatching truck 1 Scheduling ratio variable set κ 2 The dispatcher is from->Movement direction variable to adjacent six regular hexagons +.>Schedule ratio variable of the scheduler->Scheduling policy of the scheduler>Maximum capacity of the cabin of the dispatching vehicle>The dispatcher vehicle is from eta t i,0 Pick up and put in->Is a shared bicycle number variable->Scheduler arrival->And->Belongs to eta w The time scheduling car is put in eta i,1 The number of shared self-vehicles in the vehicle cabin being a ratio alpha of the number of vehicles in the vehicle cabin wh Eta in the case where a scheduling policy is expected to have been applied to a conventional scheduling vehicle before the scheduling vehicle applies the scheduling policy 5 An estimated cumulative increase amount variable +.>Increased benefit ∈after the scheduler enforces the scheduling policy>And the total amount Z of shared bicycles stored in the urban fixed warehouse at the end of the scheduled cycle time warehouse ;
Where i= {0,1,..n }, N represents the maximum value of the tag variable of the dispatching truck, I epsilon I kappa 1 ={0,1,...,5},κ 2 ={0,0.25,0.5,0.75},
In the step S3, the vehicle scheduling optimization model of the shared bicycle specifically includes:
s.t.
in the vehicle dispatching optimization model, the income increased after the dispatching vehicles implement the dispatching strategyMaximizing objective function as a short-term schedule optimization problem for shared bicycles>The calculation formula is +.>Wherein T represents a time step, T max Maximum value variable representing time step, i representing the schedule tag variable, N representing the maximum value of the schedule tag variable, +.>Representing a scheduling strategy of a scheduling vehicle;
action decision when time step variable t policy execution state variable tr=0Calculation of (2)The formula is wherein ,/>Indicating the dispatcher from->A movement direction variable to six adjacent regular hexagons,/->A scheduling ratio variable representing a scheduling vehicle;
when the time step variable t policy execution state variable tr=0, the global tag variable η of the schedule area unit 5 And a global tag eta sharing a scheduling area unit where the OD starting point of the bicycle trip is located 2 Sharing OD tag variable (η) for bicycle travel at the same time 2 ,η 3 ) Is a shared bicycle path flowThe calculation formula of (2) isAnd is also provided withWherein INT (·) represents a down integer value, < > >Shared bicycle travel demand variable representing dispatch area unit, +.>Representing the shared bicycle supply variable in the dispatch area unit when the initial given supply is t=0,/>Representation ofShared bicycle supply variable in dispatch area unit when policy execution state variable tr=1, +.>Representing a shared bicycle slave eta 2 Go out and reach eta 3 M' represents a unit tag set of the scheduling area unit;
global tag eta of unit of scheduling area where OD starting point of shared bicycle travel is located 2 Trip flow rate ratio as starting pointThe sum of (2) is 1, and the calculation formula is +.>Wherein T represents a time step variable set, η 3 Global tags representing units where OD destination points of the shared bicycle travel are located;
according to the path flowWhen the global tag variable η of the schedule area unit is the time step t policy execution state tr=0 5 And a global tag eta sharing the unit where the OD starting point of the bicycle trip is located 2 When the same, the bicycle path flow is sharedIs taken as the global tag variable eta of the dispatch area unit 5 Is the actual travel amount of the shared bicycle +.>The calculation formula is +.>
When the time step variable t policy execution state variable tr=0, the global tag variable η of the schedule area unit 5 Sharing bicycle travelGlobal tag η of the cell where the OD destination is located 3 When the same, the bicycle path flow is sharedIs taken as the global tag variable eta of the dispatch area unit 5 Is the actual attraction of the shared bicycle>The calculation formula is as follows
When the time step variable t policy execution state variable tr=0, the bicycle supply amount is sharedUpdating according to the renting and parked sharing bicycle number in the travel activity of the rider, wherein the calculation formula is as follows wherein ,/>A shared bicycle supply variable representing a policy execution state variable tr=1 after a scheduling policy has been applied at time step (t-1), a +.>Represents the time step eta of t 5 Is a shared bicycle actual travel variable, +.>Representing t time step eta 5 Is used for sharing the actual attraction variable of the bicycle;
when the time step variable t policy execution state variable tr=0, the unit tag variable that the scheduler will arrive at (t+1) time stepThe calculation formula of (2) is Wherein m represents a horizontal direction tag variable of the scheduling area unit, h represents a vertical direction tag variable of the scheduling area unit,/o>A start unit tag variable representing the time step of the scheduler (t+1), +.>Representing the slave eta of the dispatching vehicle i,0 Moving to the moving direction variable of six adjacent regular hexagons;
When the time step variable t policy execution state variable tr=0, η 5 The estimated cumulative increase/decrease amount of the supply amount of (2)The calculation formula of (2) is +.> wherein ,/>Indicating that the (i-1) th dispatcher vehicle predicts the slave eta 5 Pick up shared number of bicycle, alpha wh Indicating arrival eta of the dispatcher i,1 And eta i,1 Belongs to eta w The time scheduling car is put in eta i,1 Is a ratio of the number of shared self-vehicles to the number of vehicles in the cabin, eta w Representing a fixed warehouse location set;
at time step t policy execution state variable tr=0, the dispatcher vehicle is driven from η i,0 Will beThe shared bicycle of the vehicle is picked up and put into the cabin of the dispatching vehicle, and the +.>The shared bicycles of the vehicles are all put in eta i,1 In the number of vehicles picked up by the dispatcher +.>The calculation formula of (2) isAnd is also provided withWherein, min (·) represents taking the minimum, < ->Represents the supply amount, η when the policy execution state variable tr=0 i,0 A start unit tag variable representing a dispatcher, +.>Representing the maximum capacity of the cabin of the dispatcher vehicle, +.>A scheduling ratio variable representing a scheduling vehicle;
when the time step t strategy execution state variable tr=1, the number of vehicles picked up according to the dispatching vehicleExecuting the scheduling policy and updating eta 5 Obtaining eta after implementing the scheduling policy 5 Is a shared bicycle supply variable->The calculation formula is as follows
Total amount Z of shared bicycles stored in urban fixed warehouse warehouse The calculation mode of (a) is that
Said step S4 comprises the sub-steps of:
s41: determining elements of a shared bicycle scheduling framework based on a vehicle scheduling optimization model of the shared bicycle;
s42: determining average actions by using a one-hot coding mode;
s43: defining experience pool variables and training round related variables of a shared bicycle scheduling framework;
s44: based on the average field theory, the shared bicycle scheduling frame is constructed according to the elements, the average actions, the experience pool variables and the training round related variables of the shared bicycle scheduling frame.
2. The method for sharing a bicycle schedule according to claim 1, wherein in step S41, the elements of the shared bicycle schedule frame include statesBehavior parameter a t And a bonus function, wherein,representing the state of the dispatcher at the time step variable t,a scheduling strategy of a scheduling vehicle when the time step variable t is represented;
the rewarding function comprises a rewarding function for actually increasing the travel amount of the dispatching truckAverage increase trip amount rewarding function of dispatching vehicle>And the dispatching truck globally increases the travel function +.>The specific formula is as follows: />
wherein ,αrw Representing the scaling factor of the bonus function,indicating +. >Is to go out and go in->Indicating +.>Is to go out and go in->Indicating +.>Is to go out and go in->Indicating +.>Is to go out and go in->Represents the time step variable t +.>Number of inner dispatch vehicles>Represents the time step variable t +.>Number of inner dispatch vehicles>Representing η when implementing a scheduling policy 5 Is to go out and go in->Represents eta when no scheduling policy is implemented 5 N represents the maximum value of the tag variable of the dispatching vehicle, eta 5 The global tag representing each scheduling region unit, M' represents the set of unit tags for the scheduling region unit.
3. The method for sharing a bicycle according to claim 1, wherein in step S42, the specific method for determining the average motion is as follows: method for rewriting scheduling policy of scheduling vehicle by using one-hot coding modeGet average action +.>The calculation formulas are respectively as follows:
4. The method for sharing a bicycle schedule according to claim 1, wherein the experience pool variables of the shared bicycle schedule frame in step S43 include an experience pool And experience pool capacity->
The training round related variables comprise training round number Episode and updating training round number Episode upnet The weight coefficient omega updated by the target network and the accumulated return discount factor gamma.
5. The shared bicycle scheduling method based on deep reinforcement learning as claimed in claim 1, wherein the step S44 includes the sub-steps of:
s441: initializing an experience poolSetting experience pool capacity +.>Weight coefficient omega updated by target network and rewarding function scaling coefficient alpha rw Accumulated return discount factor gamma, initial given supply +.>Shared bicycle travel demand variable +.>And sharing bicycle slave eta 2 Go out and reach eta 3 Trip flow ratio +.>And based on the training round number Episode upnet Steps S442-S445 are performed in a loop;
s442: updating the shared bicycle operating environment when the policy execution state variable tr=0;
S444: updating the shared bicycle operating environment when the policy execution state variable tr=1;
s445: updating the next time step status of each schedulerAverage actions->And updating the increased benefit ++of the scheduler after implementing the scheduling policy according to the reward function>
S446: based on the update process of steps S442-S445, a shared bicycle scheduling frame is constructed using the reinforcement learning algorithm, and the shared bicycle scheduling is completed using the shared bicycle scheduling frame.
6. The method for sharing a bicycle scheduling according to claim 1, wherein in step S442, the specific method for updating the shared bicycle running environment when the policy execution state variable tr=0 is: updating and calculating OD tag variable (eta) of travel of shared bicycle 2 ,η 3 ) Is a shared bicycle path flowη 5 The actual travel variable of the resulting shared bicycle +.>η 5 The actual attraction variable of the shared bicycle +.>And the supply amount ++when the policy execution state variable tr=0>
In step S444, the specific method for updating the shared bicycle operating environment when the policy execution state variable tr=1 is: updating the computational behavior parameters a t The dispatching vehicle is from eta i,0 Pick up and put in reach eta i,1 Shared bicycle number variableAnd the supply amount ++when the policy execution state variable tr=1>
In step S446, the reinforcement Learning algorithm adopts a strategy gradient method or a Q-Learning method;
each dispatching truck comprises a strategy model and a value model; in the strategy gradient method, the strategy of each dispatching truckThe model comprises a strategy estimation network and a strategy target network; the strategy estimation network constructs a neural network, and parameters of the neural network are theta i The input is the state of the dispatching vehicle The output is scheduling policy->The strategy target network is formed by constructing a neural network, and parameters of the neural network are +.>Inputting the state of the schedule car in the next time step variable +.>Scheduling strategy for outputting next time step variable>
In the strategy gradient method and the Q-Learning method, a value model of each dispatching vehicle comprises a value estimation network and a value target network; the value estimation network is formed by constructing a neural network, and parameters of the neural network are as followsInputting the state of a dispatcherScheduling policy->Average actions->Output Q value function Q i Wherein the Q-value function refers to a state-action value function in a reinforcement learning algorithm that represents a cumulative prize value attained by the scheduler; the value target network is formed by constructing a neural networkThe parameters of the network are->Inputting the state of the next time step variable of the dispatching truck>Scheduling policy for next time step variable +.>And the average action of the next time step variable +.>Output target Q value function +.>
In the Q-Learning method, in the strategy model of the dispatching vehicle, the method is based on the formula and />Probability sampling and selecting to obtain action-> wherein ,/>Representing t-1 time step i ne Average action of i ne Tag variable ω representing a different scheduler than scheduler i d Representing policy parameters->Is Q i Functional expression form,/->Representing an action probability calculation function, A i Representation->Action space set of (2) and according to +.>Update->Substitution formula->According to->Probability sampling and obtaining->Will->Act as final choice of policy model of the dispatching truck;
if the reinforcement learning algorithm adopts a strategy gradient method, the reinforcement learning algorithm willStore to experience pool->And is from experience pool->A batch of samples is randomly sampled->According to sample->Updating neural network parameters of the value estimation network based on the cumulative return discount factor gamma and the loss function>Neural network parameter theta for updating strategy model by gradient descent method i The method comprises the steps of carrying out a first treatment on the surface of the Number of training rounds per interval Episode upnet In the method, parameters theta of the strategy model neural network are updated according to the weight coefficient omega updated by the target network i And parameters of the neural network of the value model +.>Neural network parameters respectively transferred to corresponding policy target networks +.>And neural network parameters of the value target network +.> wherein ,/>Representing global state s t+1 Representing the global state of the next time step, r t i Representing the prize value of the dispatcher +.>Represent the average action s t,j Representing global states in a sample, s t+1,j Global state representing next time step in the sample,/- >Representing policy in the sample, +_s>Representing the average action in the sample,/->Representing a reward value of the dispatching vehicle in the sampling sample;
if the reinforcement Learning algorithm adopts the Q-Learning method, the method willStore to experience pool->And is further from experience pool->A batch of samples is randomly sampled->According to sample->Updating neural network parameters of a value estimation network by a cumulative return discount factor gamma and a loss function>Number of training rounds per interval Episode upnet In the neural network parameters of the value model according to the weight coefficient omega updated by the target network +.>Neural network parameters transferred to the value target network>Is a kind of medium. />
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110421814 | 2021-04-20 | ||
CN2021104218142 | 2021-04-20 |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113326993A CN113326993A (en) | 2021-08-31 |
CN113326993B true CN113326993B (en) | 2023-06-09 |
Family
ID=77425362
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110744265.2A Active CN113326993B (en) | 2021-04-20 | 2021-06-30 | Shared bicycle scheduling method based on deep reinforcement learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113326993B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113997926A (en) * | 2021-11-30 | 2022-02-01 | 江苏浩峰汽车附件有限公司 | Parallel hybrid electric vehicle energy management method based on layered reinforcement learning |
CN115796399B (en) * | 2023-02-06 | 2023-04-25 | 佰聆数据股份有限公司 | Intelligent scheduling method, device, equipment and storage medium based on electric power supplies |
CN116307251B (en) * | 2023-04-12 | 2023-09-19 | 哈尔滨理工大学 | Work schedule optimization method based on reinforcement learning |
CN116402323B (en) * | 2023-06-09 | 2023-09-01 | 华东交通大学 | Taxi scheduling method |
CN116824861B (en) * | 2023-08-24 | 2023-12-05 | 北京亦庄智能城市研究院集团有限公司 | Method and system for scheduling sharing bicycle based on multidimensional data of urban brain platform |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111582469A (en) * | 2020-03-23 | 2020-08-25 | 成都信息工程大学 | Multi-agent cooperation information processing method and system, storage medium and intelligent terminal |
CN112068515A (en) * | 2020-08-27 | 2020-12-11 | 宁波工程学院 | Full-automatic parking lot scheduling method based on deep reinforcement learning |
CN112348258A (en) * | 2020-11-09 | 2021-02-09 | 合肥工业大学 | Shared bicycle predictive scheduling method based on deep Q network |
CN112417753A (en) * | 2020-11-04 | 2021-02-26 | 中国科学技术大学 | Urban public transport resource joint scheduling method |
-
2021
- 2021-06-30 CN CN202110744265.2A patent/CN113326993B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111582469A (en) * | 2020-03-23 | 2020-08-25 | 成都信息工程大学 | Multi-agent cooperation information processing method and system, storage medium and intelligent terminal |
CN112068515A (en) * | 2020-08-27 | 2020-12-11 | 宁波工程学院 | Full-automatic parking lot scheduling method based on deep reinforcement learning |
CN112417753A (en) * | 2020-11-04 | 2021-02-26 | 中国科学技术大学 | Urban public transport resource joint scheduling method |
CN112348258A (en) * | 2020-11-09 | 2021-02-09 | 合肥工业大学 | Shared bicycle predictive scheduling method based on deep Q network |
Non-Patent Citations (5)
Title |
---|
A Deep Learing Model for traffic Flow State Classification Based on Smart Phone Sensor Data;涂雯雯等;《arXiv preprint arXiv》;1709.08802 * |
Deep Reinforcement Learning with double Q-Learning;van Hasselt等;《30th Association-for-the-Advancement-of-Artificial-Intelligence(AAAI) Conference on Artificial Intelligence》;2094-2100 * |
Ibrahim Althamary等.A Survey on Multi-Agent Reinforcement Learning Methods for Vehicular Networks.《2019 15th International Wireless Communications & Mobile Computing Conference(IWCMC)》.2019,1154-1159. * |
共享单车系统的平均场理论与闭排队网络研究;樊瑞娜;《中国博士学位论文全文数据库 经济与管理科学辑》(第1期);J151-1 * |
共享单车调度路径优化研究;陈佳惠等;《交通科技与经济》;第23卷(第2期);13-20 * |
Also Published As
Publication number | Publication date |
---|---|
CN113326993A (en) | 2021-08-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113326993B (en) | Shared bicycle scheduling method based on deep reinforcement learning | |
CN111862579B (en) | Taxi scheduling method and system based on deep reinforcement learning | |
Liu et al. | Distributed and energy-efficient mobile crowdsensing with charging stations by deep reinforcement learning | |
CN107831685B (en) | Group robot control method and system | |
CN110659796A (en) | Data acquisition method in rechargeable group vehicle intelligence | |
CN108492568A (en) | A kind of Short-time Traffic Flow Forecasting Methods based on space-time characterisation analysis | |
US20220041076A1 (en) | Systems and methods for adaptive optimization for electric vehicle fleet charging | |
CN112417753A (en) | Urban public transport resource joint scheduling method | |
CN117541026B (en) | Intelligent logistics transport vehicle dispatching method and system | |
CN111352713B (en) | Automatic driving reasoning task workflow scheduling method oriented to time delay optimization | |
Shi et al. | Deep q-network based route scheduling for transportation network company vehicles | |
CN113592162A (en) | Multi-agent reinforcement learning-based multi-underwater unmanned aircraft collaborative search method | |
CN117350424A (en) | Economic dispatching and electric vehicle charging strategy combined optimization method in energy internet | |
CN116227773A (en) | Distribution path optimization method based on ant colony algorithm | |
CN115759915A (en) | Multi-constraint vehicle path planning method based on attention mechanism and deep reinforcement learning | |
Wang et al. | Optimization of ride-sharing with passenger transfer via deep reinforcement learning | |
CN116614394A (en) | Service function chain placement method based on multi-target deep reinforcement learning | |
CN113673836B (en) | Reinforced learning-based shared bus line-attaching scheduling method | |
CN117539929A (en) | Lamp post multi-source heterogeneous data storage device and method based on cloud network edge cooperation | |
CN112750298A (en) | Truck formation dynamic resource allocation method based on SMDP and DRL | |
CN117068393A (en) | Star group collaborative task planning method based on mixed expert experience playback | |
CN117420824A (en) | Path planning method based on intelligent ant colony algorithm with learning capability | |
CN116739466A (en) | Distribution center vehicle path planning method based on multi-agent deep reinforcement learning | |
CN116128028A (en) | Efficient deep reinforcement learning algorithm for continuous decision space combination optimization | |
CN115187056A (en) | Multi-agent cooperative resource allocation method considering fairness principle |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |