CN113326993A - Shared bicycle scheduling method based on deep reinforcement learning - Google Patents
Shared bicycle scheduling method based on deep reinforcement learning Download PDFInfo
- Publication number
- CN113326993A CN113326993A CN202110744265.2A CN202110744265A CN113326993A CN 113326993 A CN113326993 A CN 113326993A CN 202110744265 A CN202110744265 A CN 202110744265A CN 113326993 A CN113326993 A CN 113326993A
- Authority
- CN
- China
- Prior art keywords
- variable
- dispatching
- scheduling
- vehicle
- shared bicycle
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 126
- 230000002787 reinforcement Effects 0.000 title claims abstract description 75
- 238000005457 optimization Methods 0.000 claims abstract description 64
- 238000004364 calculation method Methods 0.000 claims abstract description 24
- 238000005290 field theory Methods 0.000 claims abstract description 12
- 230000006870 function Effects 0.000 claims description 76
- 230000009471 action Effects 0.000 claims description 56
- 238000004422 calculation algorithm Methods 0.000 claims description 47
- 238000013528 artificial neural network Methods 0.000 claims description 44
- 230000008569 process Effects 0.000 claims description 30
- 238000012549 training Methods 0.000 claims description 17
- 230000001186 cumulative effect Effects 0.000 claims description 14
- 238000005070 sampling Methods 0.000 claims description 14
- PHTXVQQRWJXYPP-UHFFFAOYSA-N ethyltrifluoromethylaminoindane Chemical compound C1=C(C(F)(F)F)C=C2CC(NCC)CC2=C1 PHTXVQQRWJXYPP-UHFFFAOYSA-N 0.000 claims description 11
- 230000006399 behavior Effects 0.000 claims description 9
- 230000008901 benefit Effects 0.000 claims description 8
- 230000000694 effects Effects 0.000 claims description 5
- 241000764238 Isis Species 0.000 claims description 3
- 238000011478 gradient descent method Methods 0.000 claims description 3
- 238000006467 substitution reaction Methods 0.000 claims description 3
- 101150050759 outI gene Proteins 0.000 claims 1
- 230000008859 change Effects 0.000 abstract description 9
- 230000007774 longterm Effects 0.000 abstract description 8
- 230000009286 beneficial effect Effects 0.000 abstract description 5
- 230000003993 interaction Effects 0.000 abstract description 5
- 238000012545 processing Methods 0.000 abstract description 5
- 239000003795 chemical substances by application Substances 0.000 description 131
- 230000009916 joint effect Effects 0.000 description 8
- 230000007704 transition Effects 0.000 description 8
- 241000251468 Actinopterygii Species 0.000 description 5
- 230000006854 communication Effects 0.000 description 5
- 238000004880 explosion Methods 0.000 description 5
- 238000004891 communication Methods 0.000 description 4
- 230000009467 reduction Effects 0.000 description 4
- 238000006243 chemical reaction Methods 0.000 description 3
- 238000010276 construction Methods 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 230000035945 sensitivity Effects 0.000 description 3
- 238000011156 evaluation Methods 0.000 description 2
- 239000013598 vector Substances 0.000 description 2
- 239000002699 waste material Substances 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 208000001613 Gambling Diseases 0.000 description 1
- 238000009825 accumulation Methods 0.000 description 1
- 230000003542 behavioural effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 230000021824 exploration behavior Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 239000002184 metal Substances 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 230000009182 swimming Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/04—Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F30/00—Computer-aided design [CAD]
- G06F30/10—Geometric CAD
- G06F30/15—Vehicle, aircraft or watercraft design
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F30/00—Computer-aided design [CAD]
- G06F30/20—Design optimisation, verification or simulation
- G06F30/27—Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/06—Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
- G06Q10/063—Operations research, analysis or management
- G06Q10/0631—Resource planning, allocation, distributing or scheduling for enterprises or organisations
- G06Q10/06315—Needs-based resource requirements planning or analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/40—Business processes related to the transportation industry
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2111/00—Details relating to CAD techniques
- G06F2111/04—Constraint-based CAD
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2111/00—Details relating to CAD techniques
- G06F2111/08—Probabilistic or stochastic CAD
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2119/00—Details relating to the type or aim of the analysis or the optimisation
- G06F2119/12—Timing analysis or timing optimisation
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Business, Economics & Management (AREA)
- Human Resources & Organizations (AREA)
- General Physics & Mathematics (AREA)
- Economics (AREA)
- Strategic Management (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Geometry (AREA)
- Entrepreneurship & Innovation (AREA)
- General Health & Medical Sciences (AREA)
- Marketing (AREA)
- Tourism & Hospitality (AREA)
- Artificial Intelligence (AREA)
- General Business, Economics & Management (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Computer Hardware Design (AREA)
- Operations Research (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Development Economics (AREA)
- Computing Systems (AREA)
- Game Theory and Decision Science (AREA)
- Molecular Biology (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Quality & Reliability (AREA)
- Medical Informatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Educational Administration (AREA)
- Automation & Control Theory (AREA)
- Aviation & Aerospace Engineering (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
Abstract
The invention discloses a shared bicycle dispatching method based on deep reinforcement learning, which comprises the following steps: s1: dividing a dispatching area of the shared bicycles to obtain dispatching area units, and determining running environment variables of the shared bicycles; s2: determining a scheduling variable of the shared bicycle; s3: constructing a vehicle dispatching optimization model of the shared bicycle; s4: the bicycle dispatching method comprises the steps of constructing a shared bicycle dispatching frame by utilizing an average field theory based on a vehicle dispatching optimization model of shared bicycles, and completing shared bicycle dispatching by utilizing the shared bicycle dispatching frame. The shared bicycle dispatching optimization method based on reinforcement learning provided by the invention is beneficial to intelligently solving the short-term and long-term dispatching optimization problem of the shared bicycles of a large-scale road network under random and complex dynamic environments. The method considers the supply and demand change of the environment and the interaction influence of the scheduling decision and the environment in the future time, does not need to predict the demand in advance or carry out manual data processing, and is not influenced by the calculation efficiency and accuracy of demand prediction.
Description
Technical Field
The invention belongs to the technical field of vehicle scheduling, and particularly relates to a shared bicycle scheduling method based on deep reinforcement learning.
Background
In the past research, the bicycle scheduling optimization problem is usually solved by dividing the scheduling time into different time periods and independently searching the optimal scheduling strategy in each time period based on the division. However, the scheduling policy for the last time period will affect the supply and demand environment for the next and future time periods. For a time-segment based isolated policy optimization method, it does not take into account the supply and demand conditions for future time segments and the resulting impact of the implemented policy. In this way, the optimal strategy for this period of time does not necessarily promote a higher actual traffic in the future time, and even causes a lower actual traffic in the future. Therefore, with the time-period-based isolated policy optimization method, an optimal global policy for full scheduling time is not necessarily obtained.
Disclosure of Invention
The invention aims to solve the problem of shared bicycle scheduling in a long-term scheduling process, a dynamic environment and a large-scale network, and provides a shared bicycle scheduling method based on deep reinforcement learning.
The technical scheme of the invention is as follows: a shared bicycle scheduling method based on deep reinforcement learning comprises the following steps:
s1: dividing a dispatching area of the shared bicycles to obtain dispatching area units, and determining running environment variables of the shared bicycles;
s2: determining a dispatching variable of the shared bicycle according to the running environment variable of the shared bicycle based on the dispatching area unit;
s3: constructing a bicycle dispatching optimization model of the shared bicycles according to the dispatching variables of the shared bicycles;
s4: the bicycle dispatching method comprises the steps of constructing a shared bicycle dispatching frame by utilizing an average field theory based on a vehicle dispatching optimization model of shared bicycles, and completing shared bicycle dispatching by utilizing the shared bicycle dispatching frame.
Further, in step S1, the specific method for dividing the dispatching area of the shared bicycles includes: dividing a dispatching area of a shared bicycle into a plurality of equilateral hexagons as dispatching area units, and defining a global label variable eta of each dispatching area unit5A horizontal direction label variable m and a vertical direction label variable h, which satisfy the following relational expression:
wherein ,η5∈M′,M′={0,1,...,((M+1)2-1) }, M denotes a maximum value of a horizontal direction tag variable or a vertical direction tag variable of a scheduling area unit, and M' denotes a unit tag set of the scheduling area unit;
in step S1, the operating environment variables of the shared bicycles include a time variable and a city fixed warehouse location set variable;
the time variable comprises a time step variable T, a time step variable set T and a maximum value variable T of the time stepmaxWherein T ∈ T, T ═ {0,1max};
The variable of the urban fixed warehouse location set comprises a fixed warehouse location set etaw。
Further, in step S2, the scheduling variables of the shared bicycles include a policy execution state variable class, a supply and demand environment variable class, a riding trip variable class, and a scheduling policy variable class;
the policy enforcement state variable class comprises a policy enforcement state variable tr, wherein tr belongs to {0,1 };
at time step t, the supply and demand environment variable class comprises the shared bicycle travel demand variable of the scheduling area unitShared bicycle supply variable of dispatch area unit when policy enforcement state variable tr is 0Shared bicycle supply variable of scheduling area unit when policy enforcement state variable tr is 1
At time step t, the riding trip variable class comprises a global label eta of a scheduling region unit where an OD starting point of the shared bicycle trip is located2And global label eta of scheduling area unit where OD destination of shared bicycle trip is located3OD tag variable (η) for shared bicycle trips2,η3) And OD flow of shared bicycle tripShared bicycle slave η2Go out and arrive at3Travel flow rate ofη5Resulting actual bicycle-shared travel variable and η5Is shared byActual suction volume of vehicle
At time step t, the scheduling strategy variable class comprises a scheduling vehicle label set I, a scheduling vehicle label variable I and a scheduling vehicle starting unit label variableDispatch vehicle arrival unit label variableSet kappa of moving direction variables of dispatching vehicle1Scheduling ratio variable set kappa2Dispatching vehicle slaveVariable moving direction to six adjacent regular hexagonsDispatch ratio variable for a dispatch vehicleDispatching strategy of dispatching vehicleMaximum capacity of cabin of dispatching vehicleDispatching vehicle slavePick up and place inShared bicycle number variableArrival of dispatching vehicleAnd isIs of ηwTime dispatching vehicle put on etai,1The ratio alpha of the number of the shared bicycles to the number of the vehicles in the vehicle compartmentwhAnd before the dispatching vehicle implements the dispatching strategy, the eta of the dispatching vehicle under the condition that the dispatching strategy is expected to be implemented by the former dispatching vehicle5Predicted cumulative increase/decrease amount of supply amount of (2)Increased revenue after dispatching vehicle implements dispatching strategyAnd total amount of shared bicycles Z stored in the fixed warehouse of the city at the end of the scheduling cycle timewarehouse;
Where, I ═ {0, 1., N }, N denotes the maximum value of the dispatcher tag variable, I ∈ I, κ1={0,1,...,5},κ2={0,0.25,0.5,0.75},
Further, step S4 includes the following sub-steps:
s41: determining elements of a shared bicycle dispatching frame based on a vehicle dispatching optimization model of the shared bicycle;
s42: determining average action by utilizing a one-hot coding mode;
s43: defining experience pool variables and training round related variables of a shared bicycle dispatching frame;
s44: and constructing the shared bicycle dispatching frame according to the elements, the average action, the experience pool variable and the training round related variable of the shared bicycle dispatching frame based on an average field theory.
Further, in step S41, the elements sharing the bicycle dispatching frame include a statusBehavior parameter atAnd a reward function, wherein,i is 0, N represents the state of the dispatch vehicle at time step variable t,i is 0, and N represents a scheduling strategy of a scheduling vehicle at a time step variable t;
the reward function comprises a reward function for actually increasing the traffic of the dispatching vehicleAverage increase traffic rewarding function of dispatching vehicleAnd dispatching vehicle overall increase go functionThe concrete formula is as follows:
wherein ,αrwThe scaling factor of the reward function is represented,when indicating implementation of scheduling policyThe actual amount of the business trip of (c),indicating when no scheduling policy is implementedThe actual amount of the business trip of (c),when indicating implementation of scheduling policyThe actual amount of the business trip of (c),when indicating not to implement scheduling policyThe actual amount of the business trip of (c),representing the time step variable tthThe number of vehicles to be scheduled in the interior,representing the time step variable tthThe number of vehicles to be scheduled in the interior,indicating when scheduling policy is implemented η5The actual amount of the business trip of (c),indicating when scheduling policy is not implemented η5N represents the maximum value of the tag variable of the dispatching car, η5A global label representing each scheduling region unit, and M' represents a unit label set of the scheduling region unit.
Further, in step S42, the specific method for determining the average motion is: scheduling strategy for rewriting scheduling vehicle by using one-hot coding modeGet the average motionThe calculation formulas are respectively as follows:
wherein ,variable denoted 0 or 1, pdimRepresents the dimension of the dispatching strategy, N represents the maximum value of the dispatching vehicle label variable,represents ineAction policy of ineTag variables representing other dispatchers than dispatcher i.
Further, in step S43, the experience pool variables of the shared bicycle dispatching framework include an experience poolAnd empirical tank capacity
The training-round related variables comprise a training-round number Episode and an updated training-round number EpisodeupnetTarget network updateA weight coefficient omega and a cumulative reward discount factor gamma.
Further, step S44 includes the following sub-steps:
s441: initializing an experience poolSetting empirical tank capacityTarget network updated weight coefficient omega and reward function scaling coefficient alpharwAccumulated reward discount factor gamma, initial given offer amountShared bicycle travel demand variableAnd sharing bicycle slave η2Go out and arrive at3Travel flow rate ofAnd based on the number of training rounds EpisodeupnetCircularly performing the steps S442-S445;
s442: updating the shared bicycle running environment when the strategy execution state variable tr is 0;
S444: updating the shared bicycle running environment when the strategy execution state variable tr is 1;
s445: updating the status of the next time step for each dispatch vehicleAnd average motionAnd updating the increased income of the dispatching vehicle after implementing the dispatching strategy according to the reward function
S446: based on the updating process of the steps S442-S445, a shared bicycle dispatching frame is constructed by using a reinforcement learning algorithm, and shared bicycle dispatching is completed by using the shared bicycle dispatching frame.
Further, in step S442, a specific method of updating the shared bicycle operating environment when the policy execution state variable tr is equal to 0 is as follows: updating and calculating OD label variable (eta) of shared bicycle trip2,η3) Shared bicycle path trafficη5Resulting actual bicycle-shared travel variableη5The actual suction volume of the shared bicycleAnd supply amount when policy execution state variable tr is 0
In step S444, a specific method of updating the shared bicycle operating environment when the policy execution state variable tr is equal to 1 is as follows: updating a calculation behavior parameter atFrom eta of dispatching vehiclei,0Pick up and place on reach ηi,1Shared bicycle number variableAnd supply amount when policy execution state variable tr is 1
In step S446, the reinforcement Learning algorithm adopts a strategy gradient method or a Q-Learning method;
each dispatching vehicle comprises a strategy model and a value model; in the strategy gradient method, a strategy model of each dispatching vehicle comprises a strategy estimation network and a strategy target network; the strategy estimation network inputs the state of the dispatching vehicle by constructing a neural networkOutput as scheduling policyThe strategy target network inputs the state of the dispatching vehicle in the next time step variable by constructing a neural networkScheduling strategy for outputting next time step variable
In the strategy gradient method and the Q-Learning method, a value model of each dispatching car comprises a value estimation network and a value target network; the value estimation network is constructed by constructing a neural network with parameters ofInput the state of the dispatching carScheduling policyAnd average motionOutput Q value function QiWherein the Q value function refers to a state-action value function in the reinforcement learning algorithm, which represents the accumulated reward value obtained by the dispatching vehicle; the value targetThe network is constructed by constructing a neural network with parameters ofInputting the state of the next time step variable of the dispatching vehicleScheduling strategy for next time step variableAnd average motion of next time step variableOutput target Q value function
In the Q-Learning method, the strategy model of the dispatching truck is based on a formulaAndprobability sampling and selecting to obtain action wherein ,representing t-1 time step ineAverage motion of ineA tag variable, ω, representing a dispatcher vehicle other than dispatcher vehicle idThe parameters of the policy are represented by,is QiThe form of the function is expressed as,representing the function of computation of probability of action, AiTo representIs based on the motion space setUpdatingSubstitution formulaAccording toProbabilistic sampling and obtainingWill be provided withThe action is taken as the final selection action of the strategy model of the dispatching vehicle;
if the reinforcement learning algorithm adopts a strategy gradient method, the method will beStore to experience poolAnd from experience poolsRandomly sampling a batch of samplesAccording to the sampleUpdating the neural network parameters of the value estimation network according to the accumulated return discount factor gamma and the loss function, and updating the neural network parameters of the strategy model by using a gradient descent method(ii) a Number of training rounds per interval EpisodeupnetAccording to the updated weight coefficient omega of the target network, the parameter theta of the strategy model neural network is adjustediParameters of neural network of sum value modelNeural network parameters respectively transmitted to corresponding strategy target networksNeural network parameters for a sum value target network wherein ,representing the global state, st+1The global state representing the next time step,a prize value indicative of a dispatch vehicle,denotes the mean motion, st,jRepresenting the global state, s, of the samplet+1,jA global state representing the next time step of the sample,the strategy for representing the sample to be sampled,which represents the average motion of the sampled samples,a reward value representing a dispatch vehicle sampling a sample;
if the reinforcement Learning algorithm adopts the Q-Learning method, the Q-Learning method will be usedStore to experience poolAnd again from experience poolsRandomly sampling a batch of samplesIn a sampleUpdating a neural network parameter of the value estimation network according to the accumulated return discount factor gamma and the loss function; number of training rounds per interval EpisodeupnetAnd transferring the neural network parameters of the value model to the neural network parameters of the value target network according to the updated weight coefficient omega of the target network.
The invention has the beneficial effects that:
(1) the shared bicycle dispatching optimization method based on reinforcement learning provided by the invention is beneficial to intelligently solving the short-term and long-term dispatching optimization problem of the shared bicycles of a large-scale road network under random and complex dynamic environments. The method does not need to predict the demand in advance or carry out manual data processing, and is not influenced by the calculation efficiency and accuracy of demand prediction. And the method is not an optimal strategy for each time segment, but is an overall optimization method of the whole scheduling process, which considers the supply and demand change of the future time segment and the influence of the scheduling decision on the supply and demand of the next time segment.
(2) The dynamic optimization scheduling strategy provided by the invention can improve the scheduling operation efficiency. The bicycle sharing system has the advantages that the bicycle sharing capacity and the bicycle sharing utilization rate are improved, and the loss of the demands of bicycle sharing users is reduced. The idle rate of shared bicycles in roads is reduced, and the number of idle vehicles which are excessively high and accumulated in certain areas is reduced. The waste of shared resources is reduced, and the phenomenon of urban environment caused by stacking of a large number of idle vehicles is improved.
(3) The actual running amount of the bicycle sharing users is increased, so that the sharing rate in the connection traffic can be improved, and the running efficiency of the public traffic system is improved. The service quality of the shared bicycles is improved, the shared bicycles are encouraged to replace motor vehicles to go out, the urban congestion and the tail gas emission of the motor vehicles are reduced, and the social welfare is increased.
Drawings
FIG. 1 is a flow chart of a shared bicycle scheduling method;
fig. 2 is a region unit coordinate diagram based on equal hexagon division.
Detailed Description
The embodiments of the present invention will be further described with reference to the accompanying drawings.
Before describing specific embodiments of the present invention, in order to make the solution of the present invention more clear and complete, the definitions of the abbreviations and key terms appearing in the present invention will be explained first:
OD traffic volume: indicating the amount of traffic going between the endpoints. "O" is derived from ORIGIN, english, and refers to the starting point of a trip, and "D" is derived from DESTINATION, english, and refers to the DESTINATION of a trip.
MFMARL algorithm: mean Field Multi-Agent relationship Learning, a Multi-Agent Reinforcement Learning algorithm based on the Mean Field game theory.
As shown in fig. 1, the present invention provides a shared bicycle scheduling method based on deep reinforcement learning, comprising the following steps:
s1: dividing a dispatching area of the shared bicycles to obtain dispatching area units, and determining running environment variables of the shared bicycles;
s2: determining a dispatching variable of the shared bicycle according to the running environment variable of the shared bicycle based on the dispatching area unit;
s3: constructing a bicycle dispatching optimization model of the shared bicycles according to the dispatching variables of the shared bicycles;
s4: the bicycle dispatching method comprises the steps of constructing a shared bicycle dispatching frame by utilizing an average field theory based on a vehicle dispatching optimization model of shared bicycles, and completing shared bicycle dispatching by utilizing the shared bicycle dispatching frame.
In the embodiment of the invention, in the sequence decision problem, the interaction influence of the supply and demand environment and the implementation scheduling strategy is considered, and the dynamic scheduling optimization problem of the shared bicycles is provided. According to the scheduling optimization cycle time length and whether surplus shared self-vehicles are considered to be placed to the urban fixed warehouse or not, the scheduling optimization problem can be divided into two problems: the bicycle scheduling optimization problem without a fixed warehouse and the bicycle scheduling optimization problem with a fixed warehouse are considered.
In the scheduling optimization problem of the shared bicycles, the optimization target is not to pursue the maximized actual running amount in a single time period or pursue the high scheduling efficiency of a single scheduling vehicle, but the maximization of the global running amount is realized through the dynamic scheduling strategy optimization with the cooperation property in the whole scheduling period. Further, in achieving the above objectives, it is contemplated herein that in the presence of an urban warehouse, the scheduling strategy includes the act of placing redundant vehicles into the warehouse to achieve a reduction in redundant bicycles in the roadway.
The present invention constructs a schedule optimization process for sharing bicycles as shown in fig. 3. In the dynamic scheduling optimization process, the bicycle renting, riding, parking and scheduling processes and the supply and demand change conditions are considered. At each time step, each dispatching vehicle picks up a certain number of shared bicycles from the current unit and loads the shared bicycles into the dispatching vehicle cabin, and then the dispatching vehicle drives to the arrival unit and places all the shared bicycles in the cabin in the arrival unit.
In the embodiment of the present invention, as shown in fig. 2, the specific method of the dispatching area of the bicycle is: dividing a dispatching area of a shared bicycle into a plurality of equilateral hexagons as dispatching area units, and defining a global label variable eta of each dispatching area unit5A horizontal direction label variable m and a vertical direction label variable h, which satisfy the following relational expression:
wherein ,η5∈M′,M′={0,1,...,((M+1)2-1) }, M denotes a maximum value of a horizontal direction tag variable or a vertical direction tag variable of a scheduling area unit, and M' denotes a unit tag set of the scheduling area unit;
in step S1, the operating environment variables of the shared bicycles include time variables, city fixed warehouse location set variables, and supply parameters;
the time variable comprises a time step variable T, a time step variable set T and a maximum value variable T of the time stepmaxWherein T ∈ T, T ═ {0,1max};
The variable of the urban fixed warehouse location set comprises a fixed warehouse location set etaw;
Some units in a city may be set as fixed warehouses of a dispatching city, and when dispatching measures are implemented, the dispatching cars may put idle shared bicycles in the fixed warehouses of the city of the units. There is no upper limit to the capacity of the fixed warehouse in the city, and the bicycles put in the warehouse by the dispatching vehicle will not be transported out and given to the rider for use. When the unit where the destination of the rider goes out is a city fixed warehouse, the shared bicycle parked by the rider is not placed in the city fixed warehouse, but is stored in the unit and can be still used by the rider in the future. The set of variable locations for the fixed city warehouses may include the locations of the units in which the fixed city warehouses are located within all of the regions.
The supply parameter comprises a first supply coefficient cdisAnd a second supply coefficient cinitial(ii) a Wherein the first supply coefficient cdisThe determination method comprises the following steps: calculating the required value of each scheduling area unit at each time step variable according to the shared bicycle required data, and taking the 40 quantiles of the required values every 10 minutes in all the scheduling area units as a first supply coefficient cdis(ii) a Second supply factor cinitialThe determination method comprises the following steps: the supply amount and the first supply coefficient c of the shared bicycles in each dispatching area unitdisAs the second supply coefficient cinitial。
Assuming that at an initial time, the shared bicycle supply for each unit in the area is evenly distributed. In order to make the influence research of the supply quantity on the trip analysis have generalization, the invention does not directly give the supply quantity number, but determines the supply quantity value according to the relation between the supply quantity and the demand. The invention defines a first supply factor cdisAnd taking the 40 quantiles of the sequences of the riding requirements of all the units every 10min for the requirement value of each unit at each time step calculated according to the requirement data. Here, the 40 quantile is selected instead of the mean value because the mean value of the data is more susceptible to the extreme value. The 40 quantile is the 40 th% of all the numerical values after being arranged from small to large. The method can avoid the problem that the analysis result can lose generalization due to the fact that a small number of high demands exist in the sequence of the riding demands of all units in each time step.
The invention defines a second supply parameter cinitialThe supply amount and the first supply coefficient c of the shared bicycle in each unit at the initial timedisThe relation ratio of (c). In the present invention, the second supply parameter cinitialIs selected as five values, cinitialE {20,50,100,200,500,1000 }. The invention defines the supply of shared bicycles in each unit as c at the initial timedisAnd cinitialAnd take its integer value down.
In the embodiment of the present invention, in step S2, the scheduling variables of the shared bicycles include a policy execution state variable class, a supply and demand environment variable class, a riding trip variable class, and a scheduling policy variable class;
the policy enforcement state variable class comprises a policy enforcement state variable tr, wherein tr belongs to {0,1 };
at time step t, the supply and demand environment variable class comprises the shared bicycle travel demand variable of the scheduling area unitShared bicycle supply variable of dispatch area unit when policy enforcement state variable tr is 0(indicating number of shared bicycles available) and policy enforcement statusShared bicycle supply variable of dispatch area unit when variable tr is 1(indicating the number of shared bicycles available for use);
at time step t, the riding trip variable class comprises a global label eta of a scheduling region unit where an OD starting point of the shared bicycle trip is located2And global label eta of scheduling area unit where OD destination of shared bicycle trip is located3OD tag variable (η) for shared bicycle trips2,η3) And OD flow of shared bicycle tripShared bicycle slave η2Go out and arrive at3Travel flow rate ofη5Resulting actual bicycle-shared travel variable and η5The actual suction volume of the shared bicycle
η2 and η3The conversion of the horizontal and vertical labels is the same as the label relation of the scheduling area unit; in units of eta2As starting point unitsIs 1 in units of3As a destination unitThe sum of the ratios of (a) to (b) is 1; when eta2=η5By unit η2As starting point unitsThe sum is equal to the actual running amountWhen eta3=η5By unit η3As a destination unitThe sum of which is equal to the actual suction amount
At time step t, the scheduling strategy variable class comprises a scheduling vehicle label set I, a scheduling vehicle label variable I and a scheduling vehicle starting unit label variableDispatch vehicle arrival unit label variableSet kappa of moving direction variables of dispatching vehicle1Scheduling ratio variable set kappa2Dispatching vehicle slaveVariable moving direction to six adjacent regular hexagonsDispatch ratio variable for a dispatch vehicleDispatching strategy of dispatching vehicleMaximum capacity of cabin of dispatching vehicleDispatching vehicle slavePick up and place inShared bicycle number variableArrival of dispatching vehicleAnd isIs of ηwTime dispatching vehicle put on etai,1The ratio alpha of the number of the shared bicycles to the number of the vehicles in the vehicle compartmentwhAnd before the dispatching vehicle implements the dispatching strategy, the eta of the dispatching vehicle under the condition that the dispatching strategy is expected to be implemented by the former dispatching vehicle5Predicted cumulative increase/decrease amount of supply amount of (2)Increased revenue after dispatching vehicle implements dispatching strategyAnd total amount of shared bicycles Z stored in the fixed warehouse of the city at the end of the scheduling cycle timewarehouse;
Where, I ═ {0, 1., N }, N denotes the maximum value of the dispatcher tag variable, I ∈ I, κ1={0,1,...,5},κ2={0,0.25,0.5,0.75},When in useWhen the number is 0 to 5, the dispatching vehicle moves to adjacent units of left lower, right side, left upper, left side, right lower and right upper, and the relation is as follows:
in the formula, M represents a horizontal direction label variable of a scheduling area unit, h represents a vertical direction label variable of the scheduling area unit, M' represents a unit label set of the scheduling area unit, and T represents a time step variable set.
and ηi,1The conversion of the horizontal and vertical labels is the same as the label relation of the scheduling area unit; the dispatch ratio variable for dispatcher i may be four percent, i.e.Unit for indicating picking up of dispatching vehicle iShared bicycle number accounting for unit at this timeIs a percentage of the number of suppliesWhen unit eta5When the number of vehicles of the shared bicycle which is expected to be adjusted away accumulatively is larger than the number of the vehicles placed on the shared bicycle, the expected accumulated increase and decrease amountThe value is negative, otherwise, when the number of vehicles of the shared bicycle which is expected to be adjusted away is less than or equal to the placing number, the expected accumulated increase and decrease quantityAre non-negative values.
In the embodiment of the present invention, in step S3, the bicycle sharing vehicle dispatching optimization model specifically includes:
s.t.
in the vehicle dispatching optimization model, the increased benefit is generated after the dispatching vehicle implements the dispatching strategyMaximizing an objective function as a short-term scheduling optimization problem for shared bikesThe calculation formula isWherein T represents a time step, TmaxA variable representing a maximum value of a time step, i represents a dispatcher vehicle tag variable, N represents a total number of dispatcher vehicle tag variables,representing a dispatching strategy of a dispatching vehicle;
the invention sets the benefits to maximize the benefits of sharing bicycles compared to the case of not performing any scheduling strategy. Decision variables are action decisions for dispatch vehiclesIncluding the direction of movement of the dispatching truckAnd scheduling ratio
When the strategy execution state variable tr of the time step variable t is equal to 0, the decision variable is an action decisionAction decisionIs calculated by the formulatr=0,i∈I, wherein ,indicating slave η of dispatching vehiclei,0The moving direction variable of the six adjacent regular hexagons,a dispatch ratio variable representing a dispatch vehicle;
when the strategy execution state variable tr of the time step variable t is equal to 0, the global label variable eta of the scheduling area unit5And a global label eta of a scheduling area unit where the OD starting point of the shared bicycle trip is located2At the same time, sharing the OD tag variable (η) of the bicycle trip2,η3) Shared bicycle path trafficIs calculated by the formulaAnd istr=0,η5∈M′,η2∈M′,η3e.M', where INT (-) denotes taking down integer values,a shared bicycle travel demand variable representing a dispatch area unit,represents a shared bicycle supply variable in the dispatch area unit when the initial given supply is t-0,represents a shared bicycle supply variable within a dispatch area unit when the policy enforcement state variable tr equals 1,representing shared bicycle slave eta2Go out and arrive at3M' represents a unit label set of a scheduling area unit;
provisioning amount of scheduling area unit tag variable for policy enforcement state tr ═ 0And the need to schedule the generation of area unit tag variablesThe smaller of these. Path flowIs not greater than the referenced scheduling area unit label variable η5Actual amount of work and η2Is a starting point and eta3Rate of travel flow to destinationAn integer of the product of (a).
Shared bicycle slave η2Go out and arrive at3Travel flow rate ofThe method meets the conservation relation between the path flow and the OD flow, and uses the global label eta of the scheduling area unit where the OD starting point of the shared bicycle trip is positioned2Starting trip flow rateIs 1, and is calculated by the formulatr=0,η2∈M′,η3e.M', where T represents a set of time step variables, η3A global tag representing a unit where an OD end point of a shared bicycle trip is located;
according to the path flowWhen the strategy execution state tr is equal to 0 at the time step t, when the global label variable eta of the scheduling area unit5And global label eta of unit where OD starting point of shared bicycle trip is located2When the same, will share the bicycle path flowThe sum of which is used as a global label variable eta of the unit of the scheduling area5Sharing the actual amount of travel of the bicycleThe calculation formula istr=0,η5∈M′,η2∈M′,η3∈M′;
When the strategy execution state variable tr of the time step variable t is equal to 0, the global label variable eta of the scheduling area unit5And global label η of unit where OD end point of shared bicycle trip is located3At the same time, the OD tag variables (η) of the bicycle trips will be shared2,η3) Shared bicycle path flowThe sum of which is used as a global label variable eta of the unit of the scheduling area5Sharing the actual suction capacity of the bicycleThe calculation formula istr=0,η5∈M′,η2∈M′,η3∈M′;
When the strategy execution state variable tr of the time step variable t is equal to 0, the bicycle supply amount is sharedThe number of the shared bicycle vehicles which are rented and parked in the traveling activities of the rider is updated according to the calculation formula wherein ,a shared bicycle supply variable indicating that the policy execution state variable tr is 1 after the scheduling policy has been implemented at time step (t-1),representing t time step time η5The actual bicycle driving amount variable of the shared bicycle,representing t time step η5The shared bicycle actual attraction amount variable;
when the strategy execution state variable tr of the time step variable t is equal to 0, the dispatching vehicle reaches the unit label variable at the (t +1) time stepIs calculated by the formulatr=0,i∈I,ηi,0∈M′,ηi,1∈M′,tr=0,i∈I,Wherein m denotes a horizontal direction label variable of the scheduling area unit, h denotes a vertical direction label variable of the scheduling area unit,a starting unit tag variable representing the time step of the dispatch vehicle (t +1),indicating slave η of dispatching vehiclei,0Moving direction variables of six adjacent regular hexagons;
when the time step variable t strategy execution state variable tr is equal to 0, eta5Predicted cumulative increase/decrease of supply amount of (2)Is calculated by the formula wherein ,indicating that the (i-1) th dispatching vehicle is predicted from eta5Picking up the number of shared bicycles, alphawhIndicating arrival of dispatching vehicle etai,1And ηi,1Is of ηwTime dispatching vehicle put on etai,1The ratio of the number of shared bicycles to the number of vehicles in the compartment, ηwRepresenting a fixed set of warehouse locations;
when the strategy execution state tr is equal to 0 at the time step t, after the front (i-1) dispatching vehicle is predicted to implement dispatching, the unit eta5Predicted cumulative increase/decrease of supply amount of (2)At unit η5In (i-1) th dispatching vehicle is not related to unit eta in the predicted implementation dispatching strategy5I.e. byAnd isEstimated cumulative increase or decreaseIs 0. According to the formula, if the (i-1) th dispatching vehicle predicts the slave unit eta5Pick upNumber of vehicles, unit η5Predicted cumulative increase/decrease of supply amount of (2)Reduction ofIf the (i-1) th dispatching vehicle is expected to be placedNumber of vehicles to unit eta5And unit η5Location set η for which the tag value does not belong to a fixed warehouse in a citywThen unit η5Predicted cumulative increase/decrease of supply amount of (2)Increase ofIf the (i-1) th dispatching vehicle is expected to be placedNumber of vehicles to unit eta5And unit η5The tag value belongs to a location set eta of a fixed warehouse of a citywThen unit η5Predicted cumulative increase/decrease of supply amount of (2)Increase ofPosition set eta of fixed warehouse in citywWhen the collection is empty, the default is to not consider the case of the urban fixed warehouse.
When the strategy execution state variable tr is equal to 0 at the time step t, the dispatching vehicle slave etai,0Will be provided withThe shared bicycles are picked up and put into the cabin of the dispatching vehicle and are to be put intoAll the shared bicycles are put on etai,1Number of vehicles picked up by dispatching vehicleIs calculated by the formulatr=0,i∈I,η5E.g. M', andtr=0,i∈I,wherein min (-) represents taking the minimum value,indicates the supply amount when the policy execution state variable tr is 0, ηi,0A starting unit tag variable representing a dispatch vehicle,represents the maximum capacity of the cabin of the dispatching truck,a dispatch ratio variable representing a dispatch vehicle;
representing according to scheduling policyRequest to pick up the current cellOf supply amount ofPercentage number of bicycles.Means for indicating a time state where tr is 0 after scheduling is performed by a scheduled vehicle (i-1) before an assumed time step tThe remaining supply amount of (c). Shared bicycle number picking up vehicle numberShould be thatNumber of vehicles, remaining supply and cabin capacity to be picked up based on scheduling policyIs a minimum value of (1), and is an integer. The whole formula isA non-negative constraint.
When the strategy execution state variable tr is 1 at the time step t, the number of vehicles picked up according to the dispatching vehicleExecuting the scheduling policy and updating η5To obtain η after implementing the scheduling policy5Shared bicycle supply variableThe calculation formula is
When the time step t and the time state tr are equal to 1, the dispatching vehicle i carries out dispatchingRear unit η5Amount of supply ofAt unit η5In the middle, the dispatching vehicle i implements the dispatching strategy without involving the unit eta5I.e. byAnd isTime, supply amountAnd is not changed. If it is adjustedDegree vehicle i slave unit eta5Pick upNumber of vehicles, unit η5Amount of supply ofReduction ofIf the dispatching car i is placedNumber of vehicles to unit eta5And unit η5Location set η for which the tag value does not belong to a fixed warehouse in a citywThen unit η5Amount of supply ofIncrease ofOtherwise when the unit eta5The tag value belongs to a location set eta of a fixed warehouse of a citywTime, unit eta5Amount of supply ofIncrease ofAnd isNumber of shared bicycles is placed to cell η by default5In urban fixed warehouses.
Total number of shared bicycles Z stored in fixed warehouses in citieswarehouseIs calculated in a manner that
The shared bicycle scheduling optimization problem assumes two conditions. Suppose the condition one: the invention assumes that each dispatching vehicle sequentially implements dispatching strategies according to the serial number of the dispatching vehicle. The riders rent the shared bicycles according to the current supply quantity of the shared bicycles in the current unit, the decision maker makes a scheduling strategy based on the supply and demand environment after the travel is finished, and then implements the scheduling strategy and updates the supply and demand environment. The policy execution state variable tr divides the time step into two time states, namely a time state for updating the travel of the rider and making a scheduling policy and a time state for implementing the scheduling policy. When tr is 0, the rider rents, uses and parks the shared bicycle, updates the supply and demand change of each unit of the time step t, and generates a scheduling strategy based on the supply and demand condition after the trip; and when tr is 1, implementing the scheduling strategy and updating the supply and demand environment under the influence of the scheduling strategy. Assume the second condition: to ensure that the dispatching vehicle does not travel to an area outside the area, the present invention assumes that the dispatching vehicle will stay at the current location when this occurs. That is, when the dispatching vehicle is about to reach the region beyond the region at the time step t +1 according to the dispatching strategy, the dispatching strategy is updated to a unit for enabling the dispatching vehicle to reach at the time step t +1For the cell in which time step t is locatedUnder this assumption, under policyIn the following, the first and second parts of the material,andthe relationship needs to be satisfied simultaneouslyAnd
in the shared bicycle scheduling optimization problem, the length of the scheduling period can be controlled by the setting of the set of time steps. The scheduling cycle of the invention for sharing the bicycle short-term scheduling optimization problem is set as one day, namely Tmax143, the time set T ═ 0, 1. In the conventional scheduling method, the scheduling period is usually one day. However, in practical situations, the problems of uneven distribution and loss of demand of shared bicycles will become more serious as time increases due to the limited number of dispatch vehicles. Especially in the later period of the operation process, since the distribution of the shared bicycle vehicles is more unbalanced, the effective strategy can be more challenging to make. Aiming at the problem of long-term operation scheduling, the invention also considers the problem of long-period dynamic scheduling optimization of the shared bicycles. The scheduling cycle of the invention for defining the long-term scheduling optimization problem of the shared bicycles is 7 days, Tmax1007, the time set is T {0, 1.
The present invention is intended to further reduce the number of shared bicycles which are excessively idle on urban roads, while satisfying the object of increasing the number of shared bicycles as much as possible. Therefore, the invention provides a dynamic scheduling optimization problem of shared bicycles comprising urban warehouses. In this problem, the present invention assumes that there is a fixed warehouse in the city and that the redundant bicycles can be stored. During a dispatch operation, the dispatch cart may be moved to a unit built in the warehouse and the stored bicycles in the cart bay are placed in the urban warehouse. Wherein, the position set eta of the fixed warehouse in the citywWhen the shared bicycle dispatching optimization problem is empty, the urban fixed condition is not considered, namely, redundant shared bicycle vehicles cannot be thrown into the urban fixed warehouse in the dispatching process of the dispatching vehicle. Conversely, when the position set eta of the fixed warehouse of the citywWhen not empty, the shared bicycle scheduling optimization problem is that the default exists in fixed citiesAnd a storage for storing excess free shared bicycle vehicles.
In the shared bicycle scheduling optimization problem, a policy execution state variable tr belongs to {0,1}, a dispatcher vehicle label set I ═ 0, 1.. multidot.N }, and a dispatcher vehicle moving direction variable set kappa of a dispatcher vehicle1A set of scheduling rate variables k, {0, 1.., 5}, a set of scheduling rate variables k2(0, 0.25,0.5, 0.75), unit tag set M' ═ 0,12-1)}. According to the variable definition of the sharing bicycle short-term scheduling optimization problem, under the condition of considering the implementation of the scheduling strategy, the sharing bicycle travel flow conservation of each unit can be ensured by the constraint conditions of the travel volume and the actual attraction volume of the actual unit.
In the constructed short-term dispatch optimization problem for shared bikes, the objective function is to maximize the increased total throughput of shared bikes in the area through which the dispatch vehicle passes, as compared to the case where no dispatch strategy is implemented. The decision variables are the action decisions of the dispatching car, including the moving direction of the dispatching car to the unit and the number of bicycles to be dispatched. The constraint conditions are that the total number of shared bicycles is conserved, the relation between the riding travel path flow and the riding OD flow is conserved, and the flow is not negative and is constrained by integers in the scheduling process. When the travel demand generated in the unit is greater than the shared bicycle available in the unit, the excess demand will be considered a lost demand.
In the embodiment of the present invention, step S4 includes the following sub-steps:
s41: determining elements of a shared bicycle dispatching frame based on a vehicle dispatching optimization model of the shared bicycle;
s42: determining average action by utilizing a one-hot coding mode;
s43: defining experience pool variables and training round related variables of a shared bicycle dispatching frame;
s44: and constructing the shared bicycle dispatching frame according to the elements, the average action, the experience pool variable and the training round related variable of the shared bicycle dispatching frame based on an average field theory.
The invention provides a shared bicycle dispatching framework of multi-agent reinforcement learning based on an average field theory according to the proposed shared bicycle dispatching optimization problem, and aims to enable an agent to change the learning riding requirement, adapt to a dynamic environment with randomness, realize dynamic decision optimization with collaboration and increase riding output.
In the embodiment of the invention, in step S41, the invention combines the shared bicycle transfer process model and the multiple intelligent reinforcement learning algorithm to construct a bicycle-sharing vehicle scheduling model. The invention defines that I is an intelligent agent label set and is equal to a label set of a dispatching car, S is a state set, AiTo representP is a transition probability function, R is a reward function and γ is a discount factor. Then the MDP-based reinforcement learning model contains six elements: g ═ G (I, S, a, P, R). Wherein I represents a dispatcher vehicle label variable and is equivalent to a label variable of an agent in the reinforcement learning algorithm, and I belongs to I ═ 0, 1.
Elements of a shared bicycle dispatch framework include statesBehavior parameter atAnd a reward function, wherein, i is 0, N represents the state of the dispatch vehicle at time step variable t,i is 0, and N represents a scheduling strategy of a scheduling vehicle at a time step variable t;
the reward function comprises a reward function for actually increasing the traffic of the dispatching vehicleAverage increase traffic rewarding function of dispatching vehicleAnd dispatching vehicle overall increase go functionThe concrete formula is as follows:
wherein ,αrwThe scaling factor of the reward function is represented,when indicating implementation of scheduling policyThe actual amount of the business trip of (c),indicating when no scheduling policy is implementedThe actual amount of the business trip of (c),when indicating implementation of scheduling policyThe actual amount of the business trip of (c),show no tone being implementedDegree of strategyThe actual amount of the business trip of (c),representing the time step variable tthThe number of vehicles to be scheduled in the interior,representing the time step variable tthThe number of vehicles to be scheduled in the interior,indicating when scheduling policy is implemented η5The actual amount of the business trip of (c),indicating when scheduling policy is not implemented η5N represents the maximum value of the tag variable of the dispatching car, η5A global label representing each scheduling region unit, and M' represents a unit label set of the scheduling region unit.
Status of stateIn (1),i is 0, and N represents the state of the dispatching vehicle at the time step variable t; the present invention assumes a state at time tThe supply quantity of the cell containing the agent i and the position number of the cell
The behavior parameter refers to the joint action of the dispatching strategy of dispatching vehicles at time t and satisfies at∈A=A0×A1×...×AN. A isSpatial set A ofiSet of vectors of (a).The action strategy of agent i is equal to the dispatching strategy of dispatching vehicle, i.e.
Agent i refers to each dispatch vehicle tag in the city,the method is an instant evaluation of the state and the generated action given by the environment in the interaction process of the agent i and the environment. The goal of agent i is to find the maximum rewardThe present invention considers three ways of reward functions.
The reward function is an immediate evaluation of the state and the generated action given by the environment during the interaction between the agent i and the environment. The goal of agent i is to find the maximum prize value. Based on the shared bicycle scheduling problem, the present invention defines the variables used in the calculation of the reward function as follows:
αrw-reward function scaling factor, dimensionless;
time step t, unit when no scheduling strategy is implementedThe actual running amount of the system is dimensionless;
time step t, unit when no scheduling strategy is implementedThe actual running amount of the system is dimensionless;
in the shared bicycle scheduling problem, the invention considers the reward function of three modes and takes the reward function asInThe function of the reward that can be selected,
(1) increased amount of travel reward function obtained by the agent: the invention defines an Increased issued quantity (PA) reward function Obtained by an Agent, which is referred to as a PA reward function and is formed byAnd (4) showing. Which represents the increased amount of shared bicycle travel that each agent obtains after performing an action. In the PA reward function, the reward in a unit that the agent has moved is considered to be the reward earned by the agent. The setting of the PA bonus function may result in the agent focusing on certain unit schedules.
(2) Average incremental bid reward function obtained by agent: the present invention defines an Average Increased Trip amount (APA) reward function Obtained by an Agent and denominates the APA reward function, comprisingAnd (4) showing. The APA reward function refers to the average increased number of shared bicycle trips that an agent obtains after performing an action.Defined as the unit η through which the vehicle is dispatchedi,0 and η i,1 executing scheduling policyThe resulting increase in mean stroke yield.
(3) Global increased throughput obtained by agent: the invention defines an Obtained globally Increased Trip amount (accessed dependent created by Agent of Total Units, APTU) reward function of an intelligent Agent, and the function is named as APTU reward function, which is formed byAnd (4) showing. The APTU reward function refers to the total area-wise increased shared bicycle traffic that all agents obtain after performing a joint action.
State transition probabilities refer to the state of each agent being updated as the time steps progress backwards, based on the combined actions and environmental interactions performed by the agents.
In the embodiment of the present invention, in step S42, a specific method for determining the average action is as follows: scheduling strategy for rewriting scheduling vehicle by using one-hot coding modeGet the average motionThe calculation formulas are respectively as follows:
wherein ,variable denoted 0 or 1, pdimRepresents the dimension of the dispatching strategy, N represents the maximum value of the dispatching vehicle label variable,represents ineAction policy of ineTag variables representing other dispatchers than dispatcher i.
The present invention rewrites the inclusion of moving direction according to one-hot coding modeAnd launch rateMotion vector ofThe actions are averaged for agent i when considering the remaining agent actions.
Joint action based on traditional multi-agent deep reinforcement learning algorithmAs atSatisfy the requirement ofCombined action atHas a dimension of (N +1) pdimThe expansion of the intelligent agent with the increase of the number of the intelligent agents can lead to the problems of network complexity, calculation efficiency reduction, strategy optimization effect reduction and the like of reinforcement learning.
However, in the MFMARL algorithm, the joint mean actionDimension of (d) is ρdim,Dimension of (d) is ρdim,Middle movementHas a dimension of 2 ρdim. Therefore, the dimension of the joint action can be controlled according to the joint action processed in the MF mode, and the calculation efficiency is ensured. In particular, in a complex simulation context, where the number of agents is typically large, employing joint averaging actions based on MF theory may alleviate the problem of dimension explosion of joint actions caused by an increase in the number of agents.
In an embodiment of the present invention, the experience pool variables of the shared bicycle dispatching frame include an experience pool in step S43And empirical tank capacity
The training-round related variables comprise a training-round number Episode and an updated training-round number EpisodeupnetThe updated weight coefficient omega of the target network and the accumulated return discount factor gamma.
In the embodiment of the present invention, step S44 includes the following sub-steps:
s441: initializing an experience poolSetting empirical tank capacityTarget network updated weight coefficient omega and reward function scaling coefficient alpharwAccumulated reward discount factor gamma, initial given offer amountShared bicycle travel demand variableAnd sharing bicycle slave η2Go out and arrive at3Travel flow rate ofAnd based on the number of training rounds EpisodeupnetCircularly performing the steps S442-S445;
s442: updating the shared bicycle running environment when the strategy execution state variable tr is 0;
S444: updating the shared bicycle running environment when the strategy execution state variable tr is 1;
s445: updating the status of the next time step for each dispatch vehicleAnd average motionAnd updating the increased income of the dispatching vehicle after implementing the dispatching strategy according to the reward function
S446: based on the updating process of the steps S442-S445, a shared bicycle dispatching frame is constructed by using a reinforcement learning algorithm, and shared bicycle dispatching is completed by using the shared bicycle dispatching frame.
Stable environment improvement of multi-agent: during the training process in a multi-agent environment, the strategy of agent i is constantly changing. For agent i, the constantly changing policies of other agents make agent i in an unstable environment. At this time, the transition probability from the state at one time to the state at the next time cannot be guaranteed to be a stable value. I.e. for anyWill exist in unstable environmentIn the case of (a) in (b),andrespectively represent t1Time t and2the policy of the agent i at the moment,andrespectively, at time t1And time t2State transition probability of agent i.
If the agent i learns the action contents of all agents in the reinforcement learning process, the environment where the agent i is located can be changed into a stable environment. Since policies can be expressed by actions and states, when all intelligence is presentWhen the state and action content of the body are known, at time t1And time t2The state transition probabilities of agent i satisfy the following equations, respectively:
therefore, the temperature of the molten metal is controlled,andmay be considered policy independent. When the strategy of the agent is changing, the transition probability from one state at a moment to the next still has stationarity, i.e. the following formula is still true:
therefore, within a model of known collective motion, for arbitrary onesThe agent i environment may be improved to a stable environment as shown by the following equation:
in this embodiment of the present invention, in step S442, a specific method for updating the shared bicycle operating environment when the policy execution state variable tr is equal to 0 includes: updating and calculating OD label variable (eta) of shared bicycle trip2,η3) Shared bicycle path trafficη5Resulting actual bicycle-shared travel variableη5The actual suction volume of the shared bicycleAnd supply amount when policy execution state variable tr is 0
In step S444, a specific method of updating the shared bicycle operating environment when the policy execution state variable tr is equal to 1 is as follows: updating a calculation behavior parameter atFrom eta of dispatching vehiclei,0Pick up and place on reach ηi,1Shared bicycle number variableAnd supply amount when policy execution state variable tr is 1
In step S446, the reinforcement Learning algorithm adopts a strategy gradient method or a Q-Learning method;
each dispatching vehicle comprises a strategy model and a value model; in the strategy gradient method, a strategy model of each dispatching vehicle comprises a strategy estimation network and a strategy target network; the strategy estimation network inputs the state of the dispatching vehicle by constructing a neural networkOutput as scheduling policyThe strategy target network inputs the state of the dispatching vehicle in the next time step variable by constructing a neural networkScheduling strategy for outputting next time step variable
In the strategy gradient method and the Q-Learning method, a value model of each dispatching car comprises a value estimation network and a value target network; the value estimation network is constructed by constructing a neural network with parameters ofInput the state of the dispatching carScheduling policyAnd average motionOutput Q value function QiWherein the Q value function refers to a state-action value function in the reinforcement learning algorithm, which represents the accumulated reward value obtained by the dispatching vehicle; the value target network is obtained by constructing a neural network with parameters ofInputting the state of the next time step variable of the dispatching vehicleScheduling strategy for next time step variableAnd average motion of next time step variableOutput target Q value functionEstimation network and target network of value model, in computing QiAndthe method adopts a forward propagation mode when the parameters are updated, and adopts a backward propagation mode when the parameters are updated, and is similar to the calculation mode of the strategy model.
In the Q-Learning method, the strategy model of the dispatching truck is based on a formulaAndprobability sampling and selecting to obtain action wherein ,representing t-1 time step ineAverage motion of ineA tag variable, ω, representing a dispatcher vehicle other than dispatcher vehicle idThe parameters of the policy are represented by,is QiThe form of the function is expressed as,representing the function of computation of probability of action, AiTo representIs based on the motion space setUpdatingSubstitution formulaAccording toProbabilistic sampling and obtainingWill be provided withThe action is taken as the final selection action of the strategy model of the dispatching vehicle;
for Q-Learning based reinforcement Learning algorithms, the policy model for each agent is based on Q of agent iiWorth obtaining an actionAnd does not include a policy object model. The value model of each agent is divided into its value estimation network and a value target network that is structurally consistent with the value estimation network. Estimated network input global state s of a value modeltLast time stepAnd average motion of last time stepOutput QiThe value is obtained. The value target network input layer is the global state value s of the next momentt+1And the action valueAnd average motion valueAnd output asThe value is obtained. Estimation network and target network of value model, in computing QiAndthe method is a forward propagation operation mode, and the method is a backward propagation mode when parameters are updated.
In the framework, each agent independently contains a policy model for decentralized action execution and a value model containing centralized state action information. The reinforcement Learning algorithm can be divided into two types, namely a strategy gradient method and a Q-Learning method.
If the reinforcement learning algorithm adopts a strategy gradient method, the method will beStore to experience poolAnd from experience poolsRandomly sampling a batch of samplesAccording to the sampleUpdating the neural network parameters of the estimation network according to the accumulated return discount factor gamma and the loss function updating value, and updating the neural network parameters of the strategy model by using a gradient descent method; number of training rounds per interval EpisodeupnetAccording to the updated weight coefficient omega of the target network, the parameter theta of the strategy model neural network is adjustediParameters of neural network of sum value modelNeural network parameters respectively transmitted to corresponding strategy target networksNeural network parameters for a sum value target network wherein ,representing the global state, st+1The global state representing the next time step,a prize value indicative of a dispatch vehicle,denotes the mean motion, st,jRepresenting the global state, s, of the samplet+1,jA global state representing the next time step of the sample,the strategy for representing the sample to be sampled,which represents the average motion of the sampled samples,a reward value representing a dispatch vehicle sampling a sample;
if the reinforcement Learning algorithm adopts the Q-Learning method, the Q-Learning method will be usedStore to experience poolAnd again from experience poolsRandomly sampling a batch of samplesIn a sampleIn accordance with the cumulative reward discount factor gammaAnd updating the neural network parameters of the value estimation network by the loss function; number of training rounds per interval EpisodeupnetAnd transferring the neural network parameters of the value model to the neural network parameters of the value target network according to the updated weight coefficient omega of the target network.
In the embodiment of the invention, according to the proposed shared bicycle scheduling optimization problem, the shared bicycle scheduling framework based on multi-agent reinforcement learning of an average field theory is proposed, and the aim is to enable the agents to change the learning riding requirements, adapt to a dynamic environment with randomness, and realize dynamic decision optimization with collaboration and increase of riding traffic.
The basic idea of framework construction is as follows:
(1) feasibility for solving shared bicycle scheduling problem by using reinforcement learning algorithm framework
In the constructed shared bicycle scheduling optimization problem, the information of the state of the supply amount at the current time is known, and the historical information of the past time is not required. That is, the status of the supply amount at the present time is related only to the status of the supply amount at the previous time and the policy action to be executed, and is independent of the status of the supply amount at the other time and the decision action condition. Therefore, the state of the supply amount at the time point that shares the bicycle scheduling optimization problem can be considered to have markov property. The shared bicycle scheduling optimization problem satisfies the assumption of markov property that the current state contains all information.
Thus, the shared bicycle scheduling problem can be translated into a Markov decision process. The Markov decision process can be solved through the reinforcement learning framework, so that the reinforcement learning framework is feasible to solve the scheduling optimization problem of the shared bicycles. The reinforcement learning does not need label data, and the self-learning of the high-dimensional mapping relation from the state without the model to the action can be realized. In view of the advantages of the reinforcement learning algorithm, the invention provides a shared bicycle dispatching optimization framework based on the reinforcement learning algorithm.
(2) Method for solving shared bicycle scheduling problem based on reinforced algorithm framework
When the multi-agent reinforcement learning algorithm is used for solving the problem of shared bicycle scheduling control, the problems of reduced learning effect, repeated scheduling of vehicles scheduled in the same area and the like occur, as shown in the following.
First, a multi-agent setting is established in a conventional reinforcement learning algorithm, and for each agent, its strategy is changing, resulting in an unstable environment. The unstable environment violates the markov state transition stability, causing policy estimation errors and reducing the efficiency of or failing to optimize the policy.
In the DQN algorithm, first, its agent i learns and selects the best strategy through the independent Q-learning algorithm. However, in a multi-agent environment, agent i updates the policy independently during the learning process, and does not consider the policies of other agents. I.e. the environment to which any agent i belongs is a non-stationary environment. At different time t, the probability of state transitionNot necessarily a stable value. However, the environment that the Q-learning algorithm converges to prove requires that the state transition probability matrix must have some stability. The case of an unstable environment is contrary to this assumption. Secondly, an agent i in the traditional DQN algorithm randomly samples data in an experience pool, and the extracted sample is used as a batch of training data of the neural network. The method can avoid the problem of state correlation in the original data. In a multi-agent environment, however, the strategy for agent i to optimize in the current state may be an ineffective strategy in the next state in the unstable environment. The single agent DQN algorithm is an inefficient learning process for learning processes in which there are invalid strategy samples and increases the likelihood of failure of the empirical replay function. Therefore, in a multi-agent environment, the conventional DQN algorithm cannot guarantee that the value function can converge to the optimal value function.
In a gradient strategy algorithm, in an unstable environment, as the number of agents increases, the variance generated by the algorithm will also increase, and the probability of the strategy being optimized in the correct direction will decrease.
Second, in the multi-agent deep reinforcement learning problem, agents are affected by the environment and other agents. During the learning process, the dimensionality of the joint action of the state-action value function will expand exponentially as the number of agents increases. Each agent estimates its own value function according to the joint strategy, and when the joint action space is large, the learning efficiency and the learning effect are reduced.
Thirdly, when the sensitivity of the estimation network of the reinforcement learning algorithm to the input variable data is not high, if the intelligent agent is influenced by a higher reward value or a higher penalty value, the intelligent agent learns the weight values of the parameters corresponding to the state and the action. This results in the agent being further desensitized to changes in state and action information and reselecting the same action. For a single agent, the condition that the action selection is too single can increase the monotonicity of the content of the learning sample data, and influence the accuracy of the fitting sum of the neural network in the estimation network. Inaccurate estimation of the dispatching vehicle strategy by the reinforcement learning algorithm can cause the dispatching efficiency to be reduced.
Fourthly, in the reinforcement learning algorithm of the multi-agent, the phenomenon that a plurality of dispatching vehicles carry shared bicycle vehicles in the same district can occur in the combined action. Non-cooperative scheduling strategies will result in scheduling inefficiencies even in some units with excessive accumulation of empty borrowable or shared bicycle vehicles due to over-scheduling.
(3) Basic idea of framework construction
According to the existing problems, for the shared bicycle dispatching optimization framework based on multi-agent deep reinforcement learning, the problem of dimension explosion caused by the number of agents in a stable environment needs to be considered, and the problem of efficient cooperative cooperation among the multi-agents is solved. The frame construction idea main points are as follows:
first, the intelligent agent learning structure of the framework:
in the distributed structure-based multi-agent deep reinforcement learning method, reinforcement learning methods can be divided into group reinforcement learning and independent reinforcement learning according to whether an agent considers the state and behavior information of the agent.
If each agent in a multi-agent system can be regarded as an independent single agent without communication capability, i.e. the agent does not consider the strategy selection of other agents in the strategy selection process, it can be called an independent learning type algorithm. In this case, the shared information data can be obtained between the agents by collective communication only after the information is fed back from the external environment. In contrast, in the group learning category algorithm, a plurality of agents are considered to be a combined group, and each agent also considers the strategy selection of other agents in the learning process.
The independent learning method can avoid dimension explosion problem in the communication process caused by the increase of the number of the intelligent agents, and can use a reinforcement learning algorithm under a static environment for reference, but has the defects of low convergence speed and long learning time. The group learning method has the advantages that the intelligent agents can be completely communicated to realize sufficient cooperation, but the search space is large and the learning time is long in the learning process. In order to realize the cooperation and communication between the intelligent agents, the strategy of other intelligent agents is considered in learning, and a group reinforcement learning method is adopted to construct a multi-agent-based deep reinforcement learning algorithm.
Second, the unstable environment of the frame improves:
for the problem of improving the unstable environment of the multi-agent, if the agent i learns the action contents of all agents in the learning process, the environment where the agent i is located can be changed into the stable environment. The state and action content of the agent is set herein to known information to improve the unstable environment.
Third, the dimension explosion problem in the framework caused by the increased number of agents improves:
mean Field gaming Theory (MFT) studies the differentiated gaming of group objects consisting of rational gambling partners. While the agent considers its state, the states of the remaining agents are still considered. The classic case of the mean field game is to train fish stocks to move in a cooperative manner. The fish do not pay attention to the swimming behavior of each fish in the group, but adjust the behavior of the fish in the adjacent area according to the behavior of the fish group. The mean field game theory can describe the behavioral response of the surrounding agents and the behavioral set of all agents by Hamilton-Jacobi-Bellman equation and Fokker-Planck-Kolmogorov equation. A Mean Field game theory-based Multi-Agent relationship Learning (MFMARL) algorithm assumes that the influence of all other agents on an Agent can be represented by a Mean distribution. MFMARL is suitable for the reinforcement learning problem of large-scale agents, and can simplify the interactive calculation among the agents. MFMARL solves the problem of the expansion of the space of the value function due to the increase in the number of agents. Thus, MFMARL is introduced herein in a shared bicycle dispatch framework and defines that each agent has the same discrete action space.
Fourthly, the sensitivity of the reinforcement learning algorithm to the change of the state and the action information is improved in the framework:
in order to improve the over-learning stability in a multi-agent deep reinforcement learning framework and improve the sensitivity of a neural network to the change of state and action information, the framework adopts a one-hot coding mode as the input of the neural network, and a hyperbolic tangent tanh (·) function is used for processing after a reward function value is scaled.
Fifth, the framework improves the efficient collaborative ability between agents:
in order to improve the efficient cooperative ability among the agents, the framework designs that the agent i can learn the action contents of all agents in the learning process, namely, the states and strategies of other agents. In addition, different forms of reward functions in the framework are discussed herein, whose impact on collaborative ability is studied.
Therefore, the invention aims at the problem of shared bicycle scheduling control, improves the stability of the learning environment of the multi-agent, and constructs the shared bicycle scheduling framework of the multi-agent reinforcement learning based on the mean field theory, which is learned by the multi-agent group.
The working principle and the process of the invention are as follows: the invention aims to establish a general frame of multi-agent deep reinforcement learning shared bicycle scheduling based on an average field theory so as to solve the problem of shared bicycle scheduling in a long-term scheduling process, a dynamic environment and a large-scale network. The method considers the stability of the state conversion of the multi-agent deep reinforcement learning algorithm, dimension explosion, the communication efficiency of the agents and the exploration behavior of the agents. A frame of a reinforcement learning algorithm is adopted, a coordinated and effective dynamic strategy is obtained in a high-dimensional action space, so that the travel requirement is met, and idle shared bicycles in a road are reduced. And defining the division of the area units by combining a reinforcement learning basic theory and the research of a shared bicycle dispatching system, and constructing a shared bicycle dispatching optimization model.
Aiming at a high-dimensional multi-main-body action space, a shared bicycle scheduling framework for multi-agent deep reinforcement learning based on an average field theory is provided. The lifting framework can be used to address long-term scheduling, dynamic environments, large-scale and complex networks. The framework does not need to predict the demand in advance or perform data processing, and is not influenced by the computational efficiency and accuracy of demand prediction. And the framework is not the optimal strategy for each time segment, but the overall optimization of the whole scheduling process, which considers the supply and demand changes of the future time segment and the influence of the scheduling decision on the supply and demand of the next time segment.
The invention has the beneficial effects that:
(1) the shared bicycle dispatching optimization method based on reinforcement learning provided by the invention is beneficial to intelligently solving the short-term and long-term dispatching optimization problem of the shared bicycles of a large-scale road network under random and complex dynamic environments. The method does not need to predict the demand in advance or carry out manual data processing, and is not influenced by the calculation efficiency and accuracy of demand prediction. And the method is not an optimal strategy for each time segment, but is an overall optimization method of the whole scheduling process, which considers the supply and demand change of the future time segment and the influence of the scheduling decision on the supply and demand of the next time segment.
(2) The dynamic optimization scheduling strategy provided by the invention can improve the scheduling operation efficiency. The bicycle sharing system has the advantages that the bicycle sharing capacity and the bicycle sharing utilization rate are improved, and the loss of the demands of bicycle sharing users is reduced. The idle rate of shared bicycles in roads is reduced, and the number of idle vehicles which are excessively high and accumulated in certain areas is reduced. The waste of shared resources is reduced, and the phenomenon of urban environment caused by stacking of a large number of idle vehicles is improved.
(3) The actual running amount of the bicycle sharing users is increased, so that the sharing rate in the connection traffic can be improved, and the running efficiency of the public traffic system is improved. The service quality of the shared bicycles is improved, the shared bicycles are encouraged to replace motor vehicles to go out, the urban congestion and the tail gas emission of the motor vehicles are reduced, and the social welfare is increased.
It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Those skilled in the art can make various other specific changes and combinations based on the teachings of the present invention without departing from the spirit of the invention, and these changes and combinations are within the scope of the invention.
Claims (10)
1. A shared bicycle scheduling method based on deep reinforcement learning is characterized by comprising the following steps:
s1: dividing a dispatching area of the shared bicycles to obtain dispatching area units, and determining running environment variables of the shared bicycles;
s2: determining a dispatching variable of the shared bicycle according to the running environment variable of the shared bicycle based on the dispatching area unit;
s3: constructing a bicycle dispatching optimization model of the shared bicycles according to the dispatching variables of the shared bicycles;
s4: the bicycle dispatching method comprises the steps of constructing a shared bicycle dispatching frame by utilizing an average field theory based on a vehicle dispatching optimization model of shared bicycles, and completing shared bicycle dispatching by utilizing the shared bicycle dispatching frame.
2. The deep reinforcement learning-based shared bicycle dispatching method according to claim 1, wherein in step S1, the specific method for dividing the dispatching area of the shared bicycles is as follows: dividing a dispatching area of the shared bicycle into a plurality of same equilateral hexagons as dispatching area units, and defining a global label variable eta of each dispatching area unit5Horizontal, horizontalA direction label variable m and a vertical direction label variable h, which satisfy the following relation:
wherein ,η5∈M′,M′={0,1,...,((M+1)2-1) }, M denotes a maximum value of a horizontal direction tag variable or a vertical direction tag variable of a scheduling area unit, and M' denotes a unit tag set of the scheduling area unit;
in step S1, the operation environment variables of the shared bicycles include a time variable and a city fixed warehouse location set variable;
the time variables comprise a time step variable T, a time step variable set T and a maximum value variable T of a time stepmaxWherein T ∈ T, T ═ {0,1max};
The variable of the urban fixed warehouse location set comprises a fixed warehouse location set etaw。
3. The deep reinforcement learning-based shared bicycle dispatching method according to claim 1, wherein in step S2, the dispatching variables of the shared bicycles comprise a policy execution state variable class, a supply and demand environment variable class, a riding trip variable class and a dispatching policy variable class;
the policy execution state variable class comprises a policy execution state variable tr, wherein tr belongs to {0,1 };
at time step t, the supply and demand environment variable class comprises a shared bicycle travel demand variable of a scheduling area unitShared bicycle supply variable of dispatch area unit when policy enforcement state variable tr is 0And scheduling zone units when the policy enforcement state variable tr is 1Shared bicycle supply variable
At time step t, the riding trip variable class comprises a global label eta of a scheduling region unit where an OD starting point of the shared bicycle trip is located2And global label eta of scheduling area unit where OD destination of shared bicycle trip is located3OD tag variable (η) for shared bicycle trips2,η3) And OD flow of shared bicycle tripShared bicycle slave η2Go out and arrive at3Travel flow rate ofη5Resulting actual bicycle-shared travel variable and η5The actual suction volume of the shared bicycle
At time step t, the scheduling strategy variable class comprises a scheduling vehicle label set I, a scheduling vehicle label variable I and a scheduling vehicle starting unit label variableDispatch vehicle arrival unit label variableSet kappa of moving direction variables of dispatching vehicle1Scheduling ratio variable set kappa2Dispatching vehicle slaveVariable moving direction to six adjacent regular hexagonsDispatch ratio variable for a dispatch vehicleDispatching strategy of dispatching vehicleMaximum capacity of cabin of dispatching vehicleDispatching vehicle slavePick up and place inShared bicycle number variableArrival of dispatching vehicleAnd isIs of ηwTime dispatching vehicle put on etai,1The ratio alpha of the number of the shared bicycles to the number of the vehicles in the vehicle compartmentwhAnd before the dispatching vehicle implements the dispatching strategy, the eta of the dispatching vehicle under the condition that the dispatching strategy is expected to be implemented by the former dispatching vehicle5Predicted cumulative increase/decrease amount of supply amount of (2)Increased revenue after dispatching vehicle implements dispatching strategyAnd total amount of shared bicycles Z stored in the fixed warehouse of the city at the end of the scheduling cycle timewarehouse;
4. The method for dispatching bicycles based on deep reinforcement learning of claim 1, wherein in step S3, the bicycle dispatching optimization model of the shared bicycles is specifically:
s.t.
in the vehicle dispatching optimization model, the increased benefit is generated after the dispatching vehicle implements the dispatching strategyMaximizing an objective function as a short-term scheduling optimization problem for shared bikesThe calculation formula isWherein T represents a time step, TmaxA maximum value variable representing a time step, i represents a dispatcher vehicle tag variable, N represents a dispatcher vehicle tag variable maximum value,representing a dispatching strategy of a dispatching vehicle;
when the strategy execution state variable tr of the time step variable t is equal to 0, the action decision is carried outIs calculated by the formula wherein ,express the dispatching car fromThe moving direction variable of the six adjacent regular hexagons,a dispatch ratio variable representing a dispatch vehicle;
when the strategy execution state variable tr of the time step variable t is equal to 0, the global label variable eta of the scheduling area unit5And a global label eta of a scheduling area unit where the OD starting point of the shared bicycle trip is located2Are the same, are shared fromOD tag variable (eta) of traveling2,η3) Shared bicycle path trafficIs calculated by the formulaAnd isWherein INT (-) denotes a downward integer value,a shared bicycle travel demand variable representing a dispatch area unit,represents a shared bicycle supply variable in the dispatch area unit when the initial given supply is t-0,represents a shared bicycle supply variable within a dispatch area unit when the policy enforcement state variable tr equals 1,representing shared bicycle slave eta2Go out and arrive at3M' represents a unit label set of a scheduling area unit;
global label eta of scheduling area unit where OD starting point of shared bicycle trip is located2Starting trip flow rateIs 1, and is calculated by the formulaWhere T represents a set of time step variables, η3A global tag representing a unit where an OD end point of a shared bicycle trip is located;
according to the path flowWhen the strategy execution state tr is equal to 0 at the time step t, when the global label variable eta of the scheduling area unit5And global label eta of unit where OD starting point of shared bicycle trip is located2When the same, will share the bicycle path flowThe sum of which is used as a global label variable eta of the unit of the scheduling area5Sharing the actual amount of travel of the bicycleThe calculation formula is
When the strategy execution state variable tr of the time step variable t is equal to 0, the global label variable eta of the scheduling area unit5And global label η of unit where OD end point of shared bicycle trip is located3When the same, will share the bicycle path flowThe sum of which is used as a global label variable eta of the unit of the scheduling area5Sharing the actual suction capacity of the bicycleThe calculation formula is
At time step variable t strategyWhen the execution state variable tr is equal to 0, the bicycle supply is sharedThe number of the shared bicycle vehicles which are rented and parked in the traveling activities of the rider is updated according to the calculation formula wherein ,a shared bicycle supply variable indicating that the policy execution state variable tr is 1 after the scheduling policy has been implemented at time step (t-1),representing t time step time η5The actual bicycle driving amount variable of the shared bicycle,representing t time step η5The shared bicycle actual attraction amount variable;
variation in time steptWhen the strategy execution state variable tr is equal to 0, the dispatching vehicle is used for obtaining the unit label variable to be reached at the (t +1) time stepIs calculated by the formula Wherein m denotes a horizontal direction label variable of the scheduling area unit, h denotes a vertical direction label variable of the scheduling area unit,indicating and dispatching vehicle(t +1) starting cell tag variable for time step,indicating slave η of dispatching vehiclei,0Moving direction variables of six adjacent regular hexagons;
when the time step variable t strategy execution state variable tr is equal to 0, eta5Predicted cumulative increase/decrease of supply amount of (2)Is calculated by the formula wherein ,indicating that the (i-1) th dispatching vehicle is predicted from eta5Picking up the number of shared bicycles, alphawhIndicating arrival of dispatching vehicle etai,1And ηi,1Is of ηwTime dispatching vehicle put on etai,1The ratio of the number of shared bicycles to the number of vehicles in the compartment, ηwRepresenting a fixed set of warehouse locations;
when the strategy execution state variable tr is equal to 0 at the time step t, the dispatching vehicle slave etai,0Will be provided withThe shared bicycles are picked up and put into the cabin of the dispatching vehicle and are to be put intoAll the shared bicycles are put on etai,1Number of vehicles picked up by dispatching vehicleIs calculated by the formulaAnd isWherein min (-) represents taking the minimum value,indicates the supply amount when the policy execution state variable tr is 0, ηi,0A starting unit tag variable representing a dispatch vehicle,represents the maximum capacity of the cabin of the dispatching truck,a dispatch ratio variable representing a dispatch vehicle;
when the strategy execution state variable tr is 1 at the time step t, the number of vehicles picked up according to the dispatching vehicleExecuting the scheduling policy and updating η5To obtain η after implementing the scheduling policy5Shared bicycle supply variableThe calculation formula is
Total number of shared bicycles Z stored in fixed warehouses in citieswarehouseIs calculated in a manner that
5. The deep reinforcement learning-based shared bicycle scheduling method according to claim 1, wherein the step S4 comprises the following sub-steps:
s41: determining elements of a shared bicycle dispatching frame based on a vehicle dispatching optimization model of the shared bicycle;
s42: determining average action by utilizing a one-hot coding mode;
s43: defining experience pool variables and training round related variables of a shared bicycle dispatching frame;
s44: and constructing the shared bicycle dispatching frame according to the elements, the average action, the experience pool variable and the training round related variable of the shared bicycle dispatching frame based on an average field theory.
6. The deep reinforcement learning-based shared bicycle scheduling method of claim 5, wherein in the step S41, the elements of the shared bicycle scheduling frame comprise statesBehavior parameter atAnd a reward function, wherein,represents the state of the dispatching vehicle at the variable t of the time step,representing a scheduling strategy of a scheduling vehicle at a time step variable t;
the reward function comprises a reward function for actually increasing the traffic of the dispatching vehicleAverage increase traffic rewarding function of dispatching vehicleAnd dispatching vehicle overall increase go functionThe concrete formula is as follows:
wherein ,αrwThe scaling factor of the reward function is represented,when indicating implementation of scheduling policyThe actual amount of the business trip of (c),indicating when no scheduling policy is implementedThe actual amount of the business trip of (c),when indicating implementation of scheduling policyThe actual amount of the business trip of (c),when indicating not to implement scheduling policyThe actual amount of the business trip of (c),representing the time step variable tthThe number of vehicles to be scheduled in the interior,representing the time step variable tthThe number of vehicles to be scheduled in the interior,indicating when scheduling policy is implemented η5The actual amount of the business trip of (c),indicating when scheduling policy is not implemented η5N represents the maximum value of the tag variable of the dispatching car, η5A global label representing each scheduling region unit, and M' represents a unit label set of the scheduling region unit.
7. The deep reinforcement learning-based shared bicycle dispatching method according to claim 5, wherein in the step S42, the specific method for determining the average action is as follows: scheduling strategy for rewriting scheduling vehicle by using one-hot coding modeGet the average motionThe calculation formulas are respectively as follows:
8. The deep reinforcement learning-based shared bicycle dispatching method according to claim 5, wherein in the step S43, the experience pool variables of the shared bicycle dispatching framework comprise an experience poolAnd empirical tank capacity
The training-round related variables comprise a training-round number Episode and an updated training-round number EpisodeupnetThe updated weight coefficient omega of the target network and the accumulated return discount factor gamma.
9. The deep reinforcement learning-based shared bicycle scheduling method according to claim 5, wherein the step S44 comprises the following sub-steps:
s441: initializing an experience poolSetting empirical tank capacityTarget network updated weight coefficient omega and reward function scaling coefficient alpharwAccumulated reward discount factor gamma, initial given offer amountShared bicycle travel demand variableAnd sharing bicycle slave η2Go out and arrive at3Travel flow rate ofAnd based on the number of training rounds EpisodeupnetCircularly performing the steps S442-S445;
s442: updating the shared bicycle running environment when the strategy execution state variable tr is 0;
S444: updating the shared bicycle running environment when the strategy execution state variable tr is 1;
s445: updating the status of the next time step for each dispatch vehicleAnd average motionAnd updating the dispatching vehicle entity according to the reward functionIncreased revenue after applying scheduling policy
S446: based on the updating process of the steps S442-S445, a shared bicycle dispatching frame is constructed by using a reinforcement learning algorithm, and shared bicycle dispatching is completed by using the shared bicycle dispatching frame.
10. The deep reinforcement learning-based shared bicycle scheduling method of claim 5, wherein in the step S442, the specific method for updating the shared bicycle operating environment when the policy execution state variable tr is 0 is as follows: updating and calculating OD label variable (eta) of shared bicycle trip2,η3) Shared bicycle path trafficη5Resulting actual bicycle-shared travel variableη5The actual suction volume of the shared bicycleAnd supply amount when policy execution state variable tr is 0
In step S444, a specific method for updating the shared bicycle operating environment when the policy execution state variable tr is equal to 1 is as follows: updating a calculation behavior parameter atFrom eta of dispatching vehiclei,0Pick up and place on reach ηi,1Shared bicycle number variableAnd supply amount when policy execution state variable tr is 1
In the step S446, the reinforcement Learning algorithm adopts a policy gradient method or a Q-Learning method;
each dispatching vehicle comprises a strategy model and a value model; in the strategy gradient method, a strategy model of each dispatching vehicle comprises a strategy estimation network and a strategy target network; the strategy estimation network is constructed by a neural network, and the parameter of the neural network is thetaiThe input is the state of the dispatching carOutput as scheduling policyThe strategy target network is constructed by a neural network with parameters ofInputting the state of the dispatching vehicle at the next time step variableScheduling strategy for outputting next time step variable
In the strategy gradient method and the Q-Learning method, a value model of each dispatching car comprises a value estimation network and a value target network; the value estimation network is constructed by constructing a neural network with parameters ofInput the state of the dispatching carScheduling policyAnd average motionOutput Q value function QiWherein the Q value function refers to a state-action value function in the reinforcement learning algorithm, which represents the accumulated reward value obtained by the dispatching vehicle; the value target network is obtained by constructing a neural network with parameters ofInputting the state of the next time step variable of the dispatching vehicleScheduling strategy for next time step variableAnd average motion of next time step variableOutput target Q value function
In the Q-Learning method, the strategy model of the dispatching truck is based on a formulaAndprobability sampling and selecting to obtain action wherein ,representing t-1 time step ineAverage motion of ineA tag variable, ω, representing a dispatcher vehicle other than dispatcher vehicle idThe parameters of the policy are represented by,is QiThe form of the function is expressed as,representing the function of computation of probability of action, AiTo representIs based on the motion space setUpdatingSubstitution formulaAccording toProbabilistic sampling and obtainingWill be provided withThe action is taken as the final selection action of the strategy model of the dispatching vehicle;
if the reinforcement learning algorithm adopts a strategy gradient method, the method will beStore to experience poolAnd from experience poolsRandomly sampling a batch of samplesAccording to the sampleUpdating neural network parameters of a value estimation network based on a cumulative return discount factor gamma and a loss functionUpdating neural network parameter theta of strategy model by gradient descent methodi(ii) a Number of training rounds per interval EpisodeupnetAccording to the updated weight coefficient omega of the target network, the parameter theta of the strategy model neural network is adjustediParameters of neural network of sum value modelNeural network parameters respectively transmitted to corresponding strategy target networksNeural network parameters for a sum value target network wherein ,representing the global state, st+1Representing the global state of the next time step, rt iA prize value indicative of a dispatch vehicle,means average ofAction, st,jRepresenting the global state, s, in the samplet+1,jRepresenting the global state of the next time step in the sample,which represents the strategy in the sample of samples,which represents the average motion in the sample of samples,representing a reward value for the dispatch vehicle in the sample;
if the reinforcement Learning algorithm adopts the Q-Learning method, the Q-Learning method will be usedStore to experience poolAnd again from experience poolsRandomly sampling a batch of samplesAccording to the sampleNeural network parameters of cumulative return discount factor gamma and loss function update value estimation networkNumber of training rounds per interval EpisodeupnetIn the method, the neural network parameters of the value model are adjusted according to the updated weight coefficient omega of the target networkNeural network parameters delivered to value target networkIn (1).
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110421814 | 2021-04-20 | ||
CN2021104218142 | 2021-04-20 |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113326993A true CN113326993A (en) | 2021-08-31 |
CN113326993B CN113326993B (en) | 2023-06-09 |
Family
ID=77425362
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110744265.2A Active CN113326993B (en) | 2021-04-20 | 2021-06-30 | Shared bicycle scheduling method based on deep reinforcement learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113326993B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113997926A (en) * | 2021-11-30 | 2022-02-01 | 江苏浩峰汽车附件有限公司 | Parallel hybrid electric vehicle energy management method based on layered reinforcement learning |
CN115796399A (en) * | 2023-02-06 | 2023-03-14 | 佰聆数据股份有限公司 | Intelligent scheduling method, device and equipment based on electric power materials and storage medium |
CN116307251A (en) * | 2023-04-12 | 2023-06-23 | 哈尔滨理工大学 | Work schedule optimization method based on reinforcement learning |
CN116402323A (en) * | 2023-06-09 | 2023-07-07 | 华东交通大学 | Taxi scheduling method |
CN116824861A (en) * | 2023-08-24 | 2023-09-29 | 北京亦庄智能城市研究院集团有限公司 | Method and system for scheduling sharing bicycle based on multidimensional data of urban brain platform |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111582469A (en) * | 2020-03-23 | 2020-08-25 | 成都信息工程大学 | Multi-agent cooperation information processing method and system, storage medium and intelligent terminal |
CN112068515A (en) * | 2020-08-27 | 2020-12-11 | 宁波工程学院 | Full-automatic parking lot scheduling method based on deep reinforcement learning |
CN112348258A (en) * | 2020-11-09 | 2021-02-09 | 合肥工业大学 | Shared bicycle predictive scheduling method based on deep Q network |
CN112417753A (en) * | 2020-11-04 | 2021-02-26 | 中国科学技术大学 | Urban public transport resource joint scheduling method |
-
2021
- 2021-06-30 CN CN202110744265.2A patent/CN113326993B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111582469A (en) * | 2020-03-23 | 2020-08-25 | 成都信息工程大学 | Multi-agent cooperation information processing method and system, storage medium and intelligent terminal |
CN112068515A (en) * | 2020-08-27 | 2020-12-11 | 宁波工程学院 | Full-automatic parking lot scheduling method based on deep reinforcement learning |
CN112417753A (en) * | 2020-11-04 | 2021-02-26 | 中国科学技术大学 | Urban public transport resource joint scheduling method |
CN112348258A (en) * | 2020-11-09 | 2021-02-09 | 合肥工业大学 | Shared bicycle predictive scheduling method based on deep Q network |
Non-Patent Citations (5)
Title |
---|
IBRAHIM ALTHAMARY等: "A Survey on Multi-Agent Reinforcement Learning Methods for Vehicular Networks", 《2019 15TH INTERNATIONAL WIRELESS COMMUNICATIONS & MOBILE COMPUTING CONFERENCE(IWCMC)》 * |
VAN HASSELT等: "Deep Reinforcement Learning with double Q-Learning", 《30TH ASSOCIATION-FOR-THE-ADVANCEMENT-OF-ARTIFICIAL-INTELLIGENCE(AAAI) CONFERENCE ON ARTIFICIAL INTELLIGENCE》 * |
樊瑞娜: "共享单车系统的平均场理论与闭排队网络研究", 《中国博士学位论文全文数据库 经济与管理科学辑》 * |
涂雯雯等: "A Deep Learing Model for traffic Flow State Classification Based on Smart Phone Sensor Data", 《ARXIV PREPRINT ARXIV》 * |
陈佳惠等: "共享单车调度路径优化研究", 《交通科技与经济》 * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113997926A (en) * | 2021-11-30 | 2022-02-01 | 江苏浩峰汽车附件有限公司 | Parallel hybrid electric vehicle energy management method based on layered reinforcement learning |
CN115796399A (en) * | 2023-02-06 | 2023-03-14 | 佰聆数据股份有限公司 | Intelligent scheduling method, device and equipment based on electric power materials and storage medium |
CN116307251A (en) * | 2023-04-12 | 2023-06-23 | 哈尔滨理工大学 | Work schedule optimization method based on reinforcement learning |
CN116307251B (en) * | 2023-04-12 | 2023-09-19 | 哈尔滨理工大学 | Work schedule optimization method based on reinforcement learning |
CN116402323A (en) * | 2023-06-09 | 2023-07-07 | 华东交通大学 | Taxi scheduling method |
CN116402323B (en) * | 2023-06-09 | 2023-09-01 | 华东交通大学 | Taxi scheduling method |
CN116824861A (en) * | 2023-08-24 | 2023-09-29 | 北京亦庄智能城市研究院集团有限公司 | Method and system for scheduling sharing bicycle based on multidimensional data of urban brain platform |
CN116824861B (en) * | 2023-08-24 | 2023-12-05 | 北京亦庄智能城市研究院集团有限公司 | Method and system for scheduling sharing bicycle based on multidimensional data of urban brain platform |
Also Published As
Publication number | Publication date |
---|---|
CN113326993B (en) | 2023-06-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113326993A (en) | Shared bicycle scheduling method based on deep reinforcement learning | |
CN111862579B (en) | Taxi scheduling method and system based on deep reinforcement learning | |
CN108417031B (en) | Intelligent parking berth reservation strategy optimization method based on Agent simulation | |
CN113222463B (en) | Data-driven neural network agent-assisted strip mine unmanned truck scheduling method | |
CN112738752A (en) | WRSN multi-mobile charger optimized scheduling method based on reinforcement learning | |
CN116227773A (en) | Distribution path optimization method based on ant colony algorithm | |
Wang et al. | Optimization of ride-sharing with passenger transfer via deep reinforcement learning | |
Xu et al. | Designing van-based mobile battery swapping and rebalancing services for dockless ebike-sharing systems based on the dueling double deep Q-network | |
CN104537446A (en) | Bilevel vehicle routing optimization method with fuzzy random time window | |
Kiaee | Integration of electric vehicles in smart grid using deep reinforcement learning | |
CN117350424A (en) | Economic dispatching and electric vehicle charging strategy combined optimization method in energy internet | |
CN115759915A (en) | Multi-constraint vehicle path planning method based on attention mechanism and deep reinforcement learning | |
CN117541026B (en) | Intelligent logistics transport vehicle dispatching method and system | |
CN112750298B (en) | Truck formation dynamic resource allocation method based on SMDP and DRL | |
CN114117910A (en) | Electric vehicle charging guide strategy method based on layered deep reinforcement learning | |
CN117592701A (en) | Scenic spot intelligent parking lot management method and system | |
Xu et al. | Research on open-pit mine vehicle scheduling problem with approximate dynamic programming | |
CN116739466A (en) | Distribution center vehicle path planning method based on multi-agent deep reinforcement learning | |
CN117032298A (en) | Unmanned aerial vehicle task allocation planning method under synchronous operation and cooperative distribution mode of truck unmanned aerial vehicle | |
CN115907066A (en) | Cement enterprise vehicle scheduling method based on hybrid sparrow intelligent optimization algorithm | |
CN115187056A (en) | Multi-agent cooperative resource allocation method considering fairness principle | |
CN114611864A (en) | Garbage vehicle low-carbon scheduling method and system | |
Dziubany et al. | Optimization of a cpss-based flexible transportation system | |
CN112561104A (en) | Vehicle sharing service order dispatching method and system based on reinforcement learning | |
CN111652550A (en) | Method, system and equipment for intelligently searching optimal loop set |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |