CN113326993A

CN113326993A - Shared bicycle scheduling method based on deep reinforcement learning

Info

Publication number: CN113326993A
Application number: CN202110744265.2A
Authority: CN
Inventors: 肖峰; 涂雯雯
Original assignee: Southwestern University Of Finance And Economics
Current assignee: Southwestern University Of Finance And Economics
Priority date: 2021-04-20
Filing date: 2021-06-30
Publication date: 2021-08-31
Anticipated expiration: 2041-06-30
Also published as: CN113326993B

Abstract

The invention discloses a shared bicycle dispatching method based on deep reinforcement learning, which comprises the following steps: s1: dividing a dispatching area of the shared bicycles to obtain dispatching area units, and determining running environment variables of the shared bicycles; s2: determining a scheduling variable of the shared bicycle; s3: constructing a vehicle dispatching optimization model of the shared bicycle; s4: the bicycle dispatching method comprises the steps of constructing a shared bicycle dispatching frame by utilizing an average field theory based on a vehicle dispatching optimization model of shared bicycles, and completing shared bicycle dispatching by utilizing the shared bicycle dispatching frame. The shared bicycle dispatching optimization method based on reinforcement learning provided by the invention is beneficial to intelligently solving the short-term and long-term dispatching optimization problem of the shared bicycles of a large-scale road network under random and complex dynamic environments. The method considers the supply and demand change of the environment and the interaction influence of the scheduling decision and the environment in the future time, does not need to predict the demand in advance or carry out manual data processing, and is not influenced by the calculation efficiency and accuracy of demand prediction.

Description

Shared bicycle scheduling method based on deep reinforcement learning

Technical Field

The invention belongs to the technical field of vehicle scheduling, and particularly relates to a shared bicycle scheduling method based on deep reinforcement learning.

Background

In the past research, the bicycle scheduling optimization problem is usually solved by dividing the scheduling time into different time periods and independently searching the optimal scheduling strategy in each time period based on the division. However, the scheduling policy for the last time period will affect the supply and demand environment for the next and future time periods. For a time-segment based isolated policy optimization method, it does not take into account the supply and demand conditions for future time segments and the resulting impact of the implemented policy. In this way, the optimal strategy for this period of time does not necessarily promote a higher actual traffic in the future time, and even causes a lower actual traffic in the future. Therefore, with the time-period-based isolated policy optimization method, an optimal global policy for full scheduling time is not necessarily obtained.

Disclosure of Invention

The invention aims to solve the problem of shared bicycle scheduling in a long-term scheduling process, a dynamic environment and a large-scale network, and provides a shared bicycle scheduling method based on deep reinforcement learning.

The technical scheme of the invention is as follows: a shared bicycle scheduling method based on deep reinforcement learning comprises the following steps:

s1: dividing a dispatching area of the shared bicycles to obtain dispatching area units, and determining running environment variables of the shared bicycles;

s2: determining a dispatching variable of the shared bicycle according to the running environment variable of the shared bicycle based on the dispatching area unit;

s3: constructing a bicycle dispatching optimization model of the shared bicycles according to the dispatching variables of the shared bicycles;

s4: the bicycle dispatching method comprises the steps of constructing a shared bicycle dispatching frame by utilizing an average field theory based on a vehicle dispatching optimization model of shared bicycles, and completing shared bicycle dispatching by utilizing the shared bicycle dispatching frame.

Further, in step S1, the specific method for dividing the dispatching area of the shared bicycles includes: dividing a dispatching area of a shared bicycle into a plurality of equilateral hexagons as dispatching area units, and defining a global label variable eta of each dispatching area unit⁵A horizontal direction label variable m and a vertical direction label variable h, which satisfy the following relational expression:

η⁵(m,h)＝(|M|+1)m+h

h∈{0,1,...,M}

wherein ,η⁵∈M′，M′＝{0,1,...,((M+1)²-1) }, M denotes a maximum value of a horizontal direction tag variable or a vertical direction tag variable of a scheduling area unit, and M' denotes a unit tag set of the scheduling area unit;

in step S1, the operating environment variables of the shared bicycles include a time variable and a city fixed warehouse location set variable;

the time variable comprises a time step variable T, a time step variable set T and a maximum value variable T of the time step_maxWherein T ∈ T, T ═ {0,1_max}；

The variable of the urban fixed warehouse location set comprises a fixed warehouse location set eta^w。

Further, in step S2, the scheduling variables of the shared bicycles include a policy execution state variable class, a supply and demand environment variable class, a riding trip variable class, and a scheduling policy variable class;

the policy enforcement state variable class comprises a policy enforcement state variable tr, wherein tr belongs to {0,1 };

at time step t, the supply and demand environment variable class comprises the shared bicycle travel demand variable of the scheduling area unit

Shared bicycle supply variable of dispatch area unit when policy enforcement state variable tr is 0

Shared bicycle supply variable of scheduling area unit when policy enforcement state variable tr is 1

At time step t, the riding trip variable class comprises a global label eta of a scheduling region unit where an OD starting point of the shared bicycle trip is located²And global label eta of scheduling area unit where OD destination of shared bicycle trip is located³OD tag variable (η) for shared bicycle trips²,η³) And OD flow of shared bicycle trip

Shared bicycle slave η²Go out and arrive at³Travel flow rate of

η⁵Resulting actual bicycle-shared travel variable

and η⁵Is shared byActual suction volume of vehicle

At time step t, the scheduling strategy variable class comprises a scheduling vehicle label set I, a scheduling vehicle label variable I and a scheduling vehicle starting unit label variable

Dispatch vehicle arrival unit label variable

Set kappa of moving direction variables of dispatching vehicle₁Scheduling ratio variable set kappa₂Dispatching vehicle slave

Variable moving direction to six adjacent regular hexagons

Dispatch ratio variable for a dispatch vehicle

Dispatching strategy of dispatching vehicle

Maximum capacity of cabin of dispatching vehicle

Dispatching vehicle slave

Pick up and place in

Shared bicycle number variable

Arrival of dispatching vehicle

And is

Is of η^wTime dispatching vehicle put on eta^i,1The ratio alpha of the number of the shared bicycles to the number of the vehicles in the vehicle compartment_whAnd before the dispatching vehicle implements the dispatching strategy, the eta of the dispatching vehicle under the condition that the dispatching strategy is expected to be implemented by the former dispatching vehicle⁵Predicted cumulative increase/decrease amount of supply amount of (2)

Increased revenue after dispatching vehicle implements dispatching strategy

And total amount of shared bicycles Z stored in the fixed warehouse of the city at the end of the scheduling cycle time_warehouse；

Where, I ═ {0, 1., N }, N denotes the maximum value of the dispatcher tag variable, I ∈ I, κ₁＝{0,1,...,5}，κ₂＝{0,0.25,0.5,0.75}，

Further, step S4 includes the following sub-steps:

s41: determining elements of a shared bicycle dispatching frame based on a vehicle dispatching optimization model of the shared bicycle;

s42: determining average action by utilizing a one-hot coding mode;

s43: defining experience pool variables and training round related variables of a shared bicycle dispatching frame;

s44: and constructing the shared bicycle dispatching frame according to the elements, the average action, the experience pool variable and the training round related variable of the shared bicycle dispatching frame based on an average field theory.

Further, in step S41, the elements sharing the bicycle dispatching frame include a status

Behavior parameter a_tAnd a reward function, wherein,

i is 0, N represents the state of the dispatch vehicle at time step variable t,

i is 0, and N represents a scheduling strategy of a scheduling vehicle at a time step variable t;

the reward function comprises a reward function for actually increasing the traffic of the dispatching vehicle

Average increase traffic rewarding function of dispatching vehicle

And dispatching vehicle overall increase go function

The concrete formula is as follows:

wherein ,α_rwThe scaling factor of the reward function is represented,

when indicating implementation of scheduling policy

The actual amount of the business trip of (c),

indicating when no scheduling policy is implemented

The actual amount of the business trip of (c),

when indicating implementation of scheduling policy

The actual amount of the business trip of (c),

when indicating not to implement scheduling policy

The actual amount of the business trip of (c),

representing the time step variable tth

The number of vehicles to be scheduled in the interior,

representing the time step variable tth

The number of vehicles to be scheduled in the interior,

indicating when scheduling policy is implemented η⁵The actual amount of the business trip of (c),

indicating when scheduling policy is not implemented η⁵N represents the maximum value of the tag variable of the dispatching car, η⁵A global label representing each scheduling region unit, and M' represents a unit label set of the scheduling region unit.

Further, in step S42, the specific method for determining the average motion is: scheduling strategy for rewriting scheduling vehicle by using one-hot coding mode

Get the average motion

The calculation formulas are respectively as follows:

wherein ,

variable denoted 0 or 1, p_dimRepresents the dimension of the dispatching strategy, N represents the maximum value of the dispatching vehicle label variable,

represents i_neAction policy of i_neTag variables representing other dispatchers than dispatcher i.

Further, in step S43, the experience pool variables of the shared bicycle dispatching framework include an experience pool

And empirical tank capacity

The training-round related variables comprise a training-round number Episode and an updated training-round number Episode_upnetTarget network updateA weight coefficient omega and a cumulative reward discount factor gamma.

Further, step S44 includes the following sub-steps:

s441: initializing an experience pool

Setting empirical tank capacity

Target network updated weight coefficient omega and reward function scaling coefficient alpha_rwAccumulated reward discount factor gamma, initial given offer amount

Shared bicycle travel demand variable

And sharing bicycle slave η²Go out and arrive at³Travel flow rate of

And based on the number of training rounds Episode_upnetCircularly performing the steps S442-S445;

s442: updating the shared bicycle running environment when the strategy execution state variable tr is 0;

s443: updating the status of each dispatching vehicle

And scheduling policy

S444: updating the shared bicycle running environment when the strategy execution state variable tr is 1;

s445: updating the status of the next time step for each dispatch vehicle

And average motion

And updating the increased income of the dispatching vehicle after implementing the dispatching strategy according to the reward function

S446: based on the updating process of the steps S442-S445, a shared bicycle dispatching frame is constructed by using a reinforcement learning algorithm, and shared bicycle dispatching is completed by using the shared bicycle dispatching frame.

Further, in step S442, a specific method of updating the shared bicycle operating environment when the policy execution state variable tr is equal to 0 is as follows: updating and calculating OD label variable (eta) of shared bicycle trip²，η³) Shared bicycle path traffic

η⁵Resulting actual bicycle-shared travel variable

η⁵The actual suction volume of the shared bicycle

And supply amount when policy execution state variable tr is 0

In step S444, a specific method of updating the shared bicycle operating environment when the policy execution state variable tr is equal to 1 is as follows: updating a calculation behavior parameter a_tFrom eta of dispatching vehicle^i,0Pick up and place on reach η^i,1Shared bicycle number variable

And supply amount when policy execution state variable tr is 1

In step S446, the reinforcement Learning algorithm adopts a strategy gradient method or a Q-Learning method;

each dispatching vehicle comprises a strategy model and a value model; in the strategy gradient method, a strategy model of each dispatching vehicle comprises a strategy estimation network and a strategy target network; the strategy estimation network inputs the state of the dispatching vehicle by constructing a neural network

Output as scheduling policy

The strategy target network inputs the state of the dispatching vehicle in the next time step variable by constructing a neural network

Scheduling strategy for outputting next time step variable

In the strategy gradient method and the Q-Learning method, a value model of each dispatching car comprises a value estimation network and a value target network; the value estimation network is constructed by constructing a neural network with parameters of

Input the state of the dispatching car

Scheduling policy

And average motion

Output Q value function QⁱWherein the Q value function refers to a state-action value function in the reinforcement learning algorithm, which represents the accumulated reward value obtained by the dispatching vehicle; the value targetThe network is constructed by constructing a neural network with parameters of

Inputting the state of the next time step variable of the dispatching vehicle

Scheduling strategy for next time step variable

And average motion of next time step variable

Output target Q value function

In the Q-Learning method, the strategy model of the dispatching truck is based on a formula

And

probability sampling and selecting to obtain action

wherein ,

representing t-1 time step i_neAverage motion of i_neA tag variable, ω, representing a dispatcher vehicle other than dispatcher vehicle i_dThe parameters of the policy are represented by,

is QⁱThe form of the function is expressed as,

representing the function of computation of probability of action, AⁱTo represent

Is based on the motion space set

Updating

Substitution formula

According to

Probabilistic sampling and obtaining

Will be provided with

The action is taken as the final selection action of the strategy model of the dispatching vehicle;

if the reinforcement learning algorithm adopts a strategy gradient method, the method will be

Store to experience pool

And from experience pools

Randomly sampling a batch of samples

According to the sample

Updating the neural network parameters of the value estimation network according to the accumulated return discount factor gamma and the loss function, and updating the neural network parameters of the strategy model by using a gradient descent method(ii) a Number of training rounds per interval Episode_upnetAccording to the updated weight coefficient omega of the target network, the parameter theta of the strategy model neural network is adjustedⁱParameters of neural network of sum value model

Neural network parameters respectively transmitted to corresponding strategy target networks

Neural network parameters for a sum value target network

wherein ,

representing the global state, s_t+1The global state representing the next time step,

a prize value indicative of a dispatch vehicle,

denotes the mean motion, s_t,jRepresenting the global state, s, of the sample_t+1,jA global state representing the next time step of the sample,

the strategy for representing the sample to be sampled,

which represents the average motion of the sampled samples,

a reward value representing a dispatch vehicle sampling a sample;

if the reinforcement Learning algorithm adopts the Q-Learning method, the Q-Learning method will be used

Store to experience pool

And again from experience pools

Randomly sampling a batch of samples

In a sample

Updating a neural network parameter of the value estimation network according to the accumulated return discount factor gamma and the loss function; number of training rounds per interval Episode_upnetAnd transferring the neural network parameters of the value model to the neural network parameters of the value target network according to the updated weight coefficient omega of the target network.

The invention has the beneficial effects that:

(1) the shared bicycle dispatching optimization method based on reinforcement learning provided by the invention is beneficial to intelligently solving the short-term and long-term dispatching optimization problem of the shared bicycles of a large-scale road network under random and complex dynamic environments. The method does not need to predict the demand in advance or carry out manual data processing, and is not influenced by the calculation efficiency and accuracy of demand prediction. And the method is not an optimal strategy for each time segment, but is an overall optimization method of the whole scheduling process, which considers the supply and demand change of the future time segment and the influence of the scheduling decision on the supply and demand of the next time segment.

(2) The dynamic optimization scheduling strategy provided by the invention can improve the scheduling operation efficiency. The bicycle sharing system has the advantages that the bicycle sharing capacity and the bicycle sharing utilization rate are improved, and the loss of the demands of bicycle sharing users is reduced. The idle rate of shared bicycles in roads is reduced, and the number of idle vehicles which are excessively high and accumulated in certain areas is reduced. The waste of shared resources is reduced, and the phenomenon of urban environment caused by stacking of a large number of idle vehicles is improved.

(3) The actual running amount of the bicycle sharing users is increased, so that the sharing rate in the connection traffic can be improved, and the running efficiency of the public traffic system is improved. The service quality of the shared bicycles is improved, the shared bicycles are encouraged to replace motor vehicles to go out, the urban congestion and the tail gas emission of the motor vehicles are reduced, and the social welfare is increased.

Drawings

FIG. 1 is a flow chart of a shared bicycle scheduling method;

fig. 2 is a region unit coordinate diagram based on equal hexagon division.

Detailed Description

The embodiments of the present invention will be further described with reference to the accompanying drawings.

Before describing specific embodiments of the present invention, in order to make the solution of the present invention more clear and complete, the definitions of the abbreviations and key terms appearing in the present invention will be explained first:

OD traffic volume: indicating the amount of traffic going between the endpoints. "O" is derived from ORIGIN, english, and refers to the starting point of a trip, and "D" is derived from DESTINATION, english, and refers to the DESTINATION of a trip.

MFMARL algorithm: mean Field Multi-Agent relationship Learning, a Multi-Agent Reinforcement Learning algorithm based on the Mean Field game theory.

As shown in fig. 1, the present invention provides a shared bicycle scheduling method based on deep reinforcement learning, comprising the following steps:

In the embodiment of the invention, in the sequence decision problem, the interaction influence of the supply and demand environment and the implementation scheduling strategy is considered, and the dynamic scheduling optimization problem of the shared bicycles is provided. According to the scheduling optimization cycle time length and whether surplus shared self-vehicles are considered to be placed to the urban fixed warehouse or not, the scheduling optimization problem can be divided into two problems: the bicycle scheduling optimization problem without a fixed warehouse and the bicycle scheduling optimization problem with a fixed warehouse are considered.

In the scheduling optimization problem of the shared bicycles, the optimization target is not to pursue the maximized actual running amount in a single time period or pursue the high scheduling efficiency of a single scheduling vehicle, but the maximization of the global running amount is realized through the dynamic scheduling strategy optimization with the cooperation property in the whole scheduling period. Further, in achieving the above objectives, it is contemplated herein that in the presence of an urban warehouse, the scheduling strategy includes the act of placing redundant vehicles into the warehouse to achieve a reduction in redundant bicycles in the roadway.

The present invention constructs a schedule optimization process for sharing bicycles as shown in fig. 3. In the dynamic scheduling optimization process, the bicycle renting, riding, parking and scheduling processes and the supply and demand change conditions are considered. At each time step, each dispatching vehicle picks up a certain number of shared bicycles from the current unit and loads the shared bicycles into the dispatching vehicle cabin, and then the dispatching vehicle drives to the arrival unit and places all the shared bicycles in the cabin in the arrival unit.

In the embodiment of the present invention, as shown in fig. 2, the specific method of the dispatching area of the bicycle is: dividing a dispatching area of a shared bicycle into a plurality of equilateral hexagons as dispatching area units, and defining a global label variable eta of each dispatching area unit⁵A horizontal direction label variable m and a vertical direction label variable h, which satisfy the following relational expression:

η⁵(m,h)＝(|M|+1)m+h

h∈{0,1,...,M}

in step S1, the operating environment variables of the shared bicycles include time variables, city fixed warehouse location set variables, and supply parameters;

The variable of the urban fixed warehouse location set comprises a fixed warehouse location set eta^w；

Some units in a city may be set as fixed warehouses of a dispatching city, and when dispatching measures are implemented, the dispatching cars may put idle shared bicycles in the fixed warehouses of the city of the units. There is no upper limit to the capacity of the fixed warehouse in the city, and the bicycles put in the warehouse by the dispatching vehicle will not be transported out and given to the rider for use. When the unit where the destination of the rider goes out is a city fixed warehouse, the shared bicycle parked by the rider is not placed in the city fixed warehouse, but is stored in the unit and can be still used by the rider in the future. The set of variable locations for the fixed city warehouses may include the locations of the units in which the fixed city warehouses are located within all of the regions.

The supply parameter comprises a first supply coefficient c_disAnd a second supply coefficient c_initial(ii) a Wherein the first supply coefficient c_disThe determination method comprises the following steps: calculating the required value of each scheduling area unit at each time step variable according to the shared bicycle required data, and taking the 40 quantiles of the required values every 10 minutes in all the scheduling area units as a first supply coefficient c_dis(ii) a Second supply factor c_initialThe determination method comprises the following steps: the supply amount and the first supply coefficient c of the shared bicycles in each dispatching area unit_disAs the second supply coefficient c_initial。

Assuming that at an initial time, the shared bicycle supply for each unit in the area is evenly distributed. In order to make the influence research of the supply quantity on the trip analysis have generalization, the invention does not directly give the supply quantity number, but determines the supply quantity value according to the relation between the supply quantity and the demand. The invention defines a first supply factor c_disAnd taking the 40 quantiles of the sequences of the riding requirements of all the units every 10min for the requirement value of each unit at each time step calculated according to the requirement data. Here, the 40 quantile is selected instead of the mean value because the mean value of the data is more susceptible to the extreme value. The 40 quantile is the 40 th% of all the numerical values after being arranged from small to large. The method can avoid the problem that the analysis result can lose generalization due to the fact that a small number of high demands exist in the sequence of the riding demands of all units in each time step.

The invention defines a second supply parameter c_initialThe supply amount and the first supply coefficient c of the shared bicycle in each unit at the initial time_disThe relation ratio of (c). In the present invention, the second supply parameter c_initialIs selected as five values, c_initialE {20,50,100,200,500,1000 }. The invention defines the supply of shared bicycles in each unit as c at the initial time_disAnd c_initialAnd take its integer value down.

In the embodiment of the present invention, in step S2, the scheduling variables of the shared bicycles include a policy execution state variable class, a supply and demand environment variable class, a riding trip variable class, and a scheduling policy variable class;

(indicating number of shared bicycles available) and policy enforcement statusShared bicycle supply variable of dispatch area unit when variable tr is 1

(indicating the number of shared bicycles available for use);

Shared bicycle slave η²Go out and arrive at³Travel flow rate of

η⁵Resulting actual bicycle-shared travel variable

and η⁵The actual suction volume of the shared bicycle

η² and η³The conversion of the horizontal and vertical labels is the same as the label relation of the scheduling area unit; in units of eta²As starting point units

Is 1 in units of³As a destination unit

The sum of the ratios of (a) to (b) is 1; when eta²＝η⁵By unit η²As starting point units

The sum is equal to the actual running amount

When eta³＝η⁵By unit η³As a destination unit

The sum of which is equal to the actual suction amount

Dispatch vehicle arrival unit label variable

Variable moving direction to six adjacent regular hexagons

Dispatch ratio variable for a dispatch vehicle

Dispatching strategy of dispatching vehicle

Maximum capacity of cabin of dispatching vehicle

Dispatching vehicle slave

Pick up and place in

Shared bicycle number variable

Arrival of dispatching vehicle

And is

Increased revenue after dispatching vehicle implements dispatching strategy

When in use

When the number is 0 to 5, the dispatching vehicle moves to adjacent units of left lower, right side, left upper, left side, right lower and right upper, and the relation is as follows:

in the formula, M represents a horizontal direction label variable of a scheduling area unit, h represents a vertical direction label variable of the scheduling area unit, M' represents a unit label set of the scheduling area unit, and T represents a time step variable set.

and η^i,1The conversion of the horizontal and vertical labels is the same as the label relation of the scheduling area unit; the dispatch ratio variable for dispatcher i may be four percent, i.e.

Unit for indicating picking up of dispatching vehicle i

Shared bicycle number accounting for unit at this time

Is a percentage of the number of supplies

When unit eta⁵When the number of vehicles of the shared bicycle which is expected to be adjusted away accumulatively is larger than the number of the vehicles placed on the shared bicycle, the expected accumulated increase and decrease amount

The value is negative, otherwise, when the number of vehicles of the shared bicycle which is expected to be adjusted away is less than or equal to the placing number, the expected accumulated increase and decrease quantity

Are non-negative values.

In the embodiment of the present invention, in step S3, the bicycle sharing vehicle dispatching optimization model specifically includes:

s.t.

in the vehicle dispatching optimization model, the increased benefit is generated after the dispatching vehicle implements the dispatching strategy

Maximizing an objective function as a short-term scheduling optimization problem for shared bikes

The calculation formula is

Wherein T represents a time step, T_maxA variable representing a maximum value of a time step, i represents a dispatcher vehicle tag variable, N represents a total number of dispatcher vehicle tag variables,

representing a dispatching strategy of a dispatching vehicle;

the invention sets the benefits to maximize the benefits of sharing bicycles compared to the case of not performing any scheduling strategy. Decision variables are action decisions for dispatch vehicles

Including the direction of movement of the dispatching truck

And scheduling ratio

When the strategy execution state variable tr of the time step variable t is equal to 0, the decision variable is an action decision

Action decision

Is calculated by the formula

tr＝0,i∈I,

wherein ,

indicating slave η of dispatching vehicle^i,0The moving direction variable of the six adjacent regular hexagons,

a dispatch ratio variable representing a dispatch vehicle;

when the strategy execution state variable tr of the time step variable t is equal to 0, the global label variable eta of the scheduling area unit⁵And a global label eta of a scheduling area unit where the OD starting point of the shared bicycle trip is located²At the same time, sharing the OD tag variable (η) of the bicycle trip²，η³) Shared bicycle path traffic

Is calculated by the formula

And is

tr＝0,η⁵∈M′,η²∈M′,η³e.M', where INT (-) denotes taking down integer values,

a shared bicycle travel demand variable representing a dispatch area unit,

represents a shared bicycle supply variable in the dispatch area unit when the initial given supply is t-0,

represents a shared bicycle supply variable within a dispatch area unit when the policy enforcement state variable tr equals 1,

representing shared bicycle slave eta²Go out and arrive at³M' represents a unit label set of a scheduling area unit;

provisioning amount of scheduling area unit tag variable for policy enforcement state tr ═ 0

And the need to schedule the generation of area unit tag variables

The smaller of these. Path flow

Is not greater than the referenced scheduling area unit label variable η⁵Actual amount of work and η²Is a starting point and eta³Rate of travel flow to destination

An integer of the product of (a).

Shared bicycle slave η²Go out and arrive at³Travel flow rate of

The method meets the conservation relation between the path flow and the OD flow, and uses the global label eta of the scheduling area unit where the OD starting point of the shared bicycle trip is positioned²Starting trip flow rate

Is 1, and is calculated by the formula

tr＝0,η²∈M′,η³e.M', where T represents a set of time step variables, η³A global tag representing a unit where an OD end point of a shared bicycle trip is located;

according to the path flow

When the strategy execution state tr is equal to 0 at the time step t, when the global label variable eta of the scheduling area unit⁵And global label eta of unit where OD starting point of shared bicycle trip is located²When the same, will share the bicycle path flow

The sum of which is used as a global label variable eta of the unit of the scheduling area⁵Sharing the actual amount of travel of the bicycle

The calculation formula is

tr＝0,η⁵∈M′,η²∈M′,η³∈M′；

When the strategy execution state variable tr of the time step variable t is equal to 0, the global label variable eta of the scheduling area unit⁵And global label η of unit where OD end point of shared bicycle trip is located³At the same time, the OD tag variables (η) of the bicycle trips will be shared²，η³) Shared bicycle path flow

The sum of which is used as a global label variable eta of the unit of the scheduling area⁵Sharing the actual suction capacity of the bicycle

The calculation formula is

tr＝0,η⁵∈M′,η²∈M′,η³∈M′；

When the strategy execution state variable tr of the time step variable t is equal to 0, the bicycle supply amount is shared

The number of the shared bicycle vehicles which are rented and parked in the traveling activities of the rider is updated according to the calculation formula

wherein ,

a shared bicycle supply variable indicating that the policy execution state variable tr is 1 after the scheduling policy has been implemented at time step (t-1),

representing t time step time η⁵The actual bicycle driving amount variable of the shared bicycle,

representing t time step η⁵The shared bicycle actual attraction amount variable;

when the strategy execution state variable tr of the time step variable t is equal to 0, the dispatching vehicle reaches the unit label variable at the (t +1) time step

Is calculated by the formula

tr＝0,i∈I,η^i,0∈M′,η^i,1∈M′，

tr＝0,i∈I,

Wherein m denotes a horizontal direction label variable of the scheduling area unit, h denotes a vertical direction label variable of the scheduling area unit,

a starting unit tag variable representing the time step of the dispatch vehicle (t +1),

indicating slave η of dispatching vehicle^i,0Moving direction variables of six adjacent regular hexagons;

when the time step variable t strategy execution state variable tr is equal to 0, eta⁵Predicted cumulative increase/decrease of supply amount of (2)

Is calculated by the formula

wherein ,

indicating that the (i-1) th dispatching vehicle is predicted from eta⁵Picking up the number of shared bicycles, alpha_whIndicating arrival of dispatching vehicle eta^i,1And η^i,1Is of η^wTime dispatching vehicle put on eta^i,1The ratio of the number of shared bicycles to the number of vehicles in the compartment, η^wRepresenting a fixed set of warehouse locations;

when the strategy execution state tr is equal to 0 at the time step t, after the front (i-1) dispatching vehicle is predicted to implement dispatching, the unit eta⁵Predicted cumulative increase/decrease of supply amount of (2)

At unit η⁵In (i-1) th dispatching vehicle is not related to unit eta in the predicted implementation dispatching strategy⁵I.e. by

And is

Estimated cumulative increase or decrease

Is 0. According to the formula, if the (i-1) th dispatching vehicle predicts the slave unit eta⁵Pick up

Number of vehicles, unit η⁵Predicted cumulative increase/decrease of supply amount of (2)

Reduction of

If the (i-1) th dispatching vehicle is expected to be placed

Number of vehicles to unit eta⁵And unit η⁵Location set η for which the tag value does not belong to a fixed warehouse in a city^wThen unit η⁵Predicted cumulative increase/decrease of supply amount of (2)

Increase of

If the (i-1) th dispatching vehicle is expected to be placed

Number of vehicles to unit eta⁵And unit η⁵The tag value belongs to a location set eta of a fixed warehouse of a city^wThen unit η⁵Predicted cumulative increase/decrease of supply amount of (2)

Increase of

Position set eta of fixed warehouse in city^wWhen the collection is empty, the default is to not consider the case of the urban fixed warehouse.

When the strategy execution state variable tr is equal to 0 at the time step t, the dispatching vehicle slave eta^i,0Will be provided with

The shared bicycles are picked up and put into the cabin of the dispatching vehicle and are to be put into

All the shared bicycles are put on eta^i,1Number of vehicles picked up by dispatching vehicle

Is calculated by the formula

tr＝0,i∈I,

η⁵E.g. M', and

tr＝0,i∈I,

wherein min (-) represents taking the minimum value,

indicates the supply amount when the policy execution state variable tr is 0, η^i,0A starting unit tag variable representing a dispatch vehicle,

represents the maximum capacity of the cabin of the dispatching truck,

a dispatch ratio variable representing a dispatch vehicle;

representing according to scheduling policy

Request to pick up the current cell

Of supply amount of

Percentage number of bicycles.

Means for indicating a time state where tr is 0 after scheduling is performed by a scheduled vehicle (i-1) before an assumed time step t

The remaining supply amount of (c). Shared bicycle number picking up vehicle number

Should be thatNumber of vehicles, remaining supply and cabin capacity to be picked up based on scheduling policy

Is a minimum value of (1), and is an integer. The whole formula is

A non-negative constraint.

When the strategy execution state variable tr is 1 at the time step t, the number of vehicles picked up according to the dispatching vehicle

Executing the scheduling policy and updating η⁵To obtain η after implementing the scheduling policy⁵Shared bicycle supply variable

The calculation formula is

When the time step t and the time state tr are equal to 1, the dispatching vehicle i carries out dispatching

Rear unit η⁵Amount of supply of

At unit η⁵In the middle, the dispatching vehicle i implements the dispatching strategy without involving the unit eta⁵I.e. by

And is

Time, supply amount

And is not changed. If it is adjustedDegree vehicle i slave unit eta⁵Pick up

Number of vehicles, unit η⁵Amount of supply of

Reduction of

If the dispatching car i is placed

Number of vehicles to unit eta⁵And unit η⁵Location set η for which the tag value does not belong to a fixed warehouse in a city^wThen unit η⁵Amount of supply of

Increase of

Otherwise when the unit eta⁵The tag value belongs to a location set eta of a fixed warehouse of a city^wTime, unit eta⁵Amount of supply of

Increase of

And is

Number of shared bicycles is placed to cell η by default⁵In urban fixed warehouses.

Total number of shared bicycles Z stored in fixed warehouses in cities_warehouseIs calculated in a manner that

The shared bicycle scheduling optimization problem assumes two conditions. Suppose the condition one: the invention assumes that each dispatching vehicle sequentially implements dispatching strategies according to the serial number of the dispatching vehicle. The riders rent the shared bicycles according to the current supply quantity of the shared bicycles in the current unit, the decision maker makes a scheduling strategy based on the supply and demand environment after the travel is finished, and then implements the scheduling strategy and updates the supply and demand environment. The policy execution state variable tr divides the time step into two time states, namely a time state for updating the travel of the rider and making a scheduling policy and a time state for implementing the scheduling policy. When tr is 0, the rider rents, uses and parks the shared bicycle, updates the supply and demand change of each unit of the time step t, and generates a scheduling strategy based on the supply and demand condition after the trip; and when tr is 1, implementing the scheduling strategy and updating the supply and demand environment under the influence of the scheduling strategy. Assume the second condition: to ensure that the dispatching vehicle does not travel to an area outside the area, the present invention assumes that the dispatching vehicle will stay at the current location when this occurs. That is, when the dispatching vehicle is about to reach the region beyond the region at the time step t +1 according to the dispatching strategy, the dispatching strategy is updated to a unit for enabling the dispatching vehicle to reach at the time step t +1

For the cell in which time step t is located

Under this assumption, under policy

In the following, the first and second parts of the material,

and

the relationship needs to be satisfied simultaneously

And

tr＝0,i∈I,

in the shared bicycle scheduling optimization problem, the length of the scheduling period can be controlled by the setting of the set of time steps. The scheduling cycle of the invention for sharing the bicycle short-term scheduling optimization problem is set as one day, namely T_max143, the time set T ═ 0, 1. In the conventional scheduling method, the scheduling period is usually one day. However, in practical situations, the problems of uneven distribution and loss of demand of shared bicycles will become more serious as time increases due to the limited number of dispatch vehicles. Especially in the later period of the operation process, since the distribution of the shared bicycle vehicles is more unbalanced, the effective strategy can be more challenging to make. Aiming at the problem of long-term operation scheduling, the invention also considers the problem of long-period dynamic scheduling optimization of the shared bicycles. The scheduling cycle of the invention for defining the long-term scheduling optimization problem of the shared bicycles is 7 days, T_max1007, the time set is T {0, 1.

The present invention is intended to further reduce the number of shared bicycles which are excessively idle on urban roads, while satisfying the object of increasing the number of shared bicycles as much as possible. Therefore, the invention provides a dynamic scheduling optimization problem of shared bicycles comprising urban warehouses. In this problem, the present invention assumes that there is a fixed warehouse in the city and that the redundant bicycles can be stored. During a dispatch operation, the dispatch cart may be moved to a unit built in the warehouse and the stored bicycles in the cart bay are placed in the urban warehouse. Wherein, the position set eta of the fixed warehouse in the city^wWhen the shared bicycle dispatching optimization problem is empty, the urban fixed condition is not considered, namely, redundant shared bicycle vehicles cannot be thrown into the urban fixed warehouse in the dispatching process of the dispatching vehicle. Conversely, when the position set eta of the fixed warehouse of the city^wWhen not empty, the shared bicycle scheduling optimization problem is that the default exists in fixed citiesAnd a storage for storing excess free shared bicycle vehicles.

In the shared bicycle scheduling optimization problem, a policy execution state variable tr belongs to {0,1}, a dispatcher vehicle label set I ═ 0, 1.. multidot.N }, and a dispatcher vehicle moving direction variable set kappa of a dispatcher vehicle₁A set of scheduling rate variables k, {0, 1.., 5}, a set of scheduling rate variables k₂(0, 0.25,0.5, 0.75), unit tag set M' ═ 0,1²-1)}. According to the variable definition of the sharing bicycle short-term scheduling optimization problem, under the condition of considering the implementation of the scheduling strategy, the sharing bicycle travel flow conservation of each unit can be ensured by the constraint conditions of the travel volume and the actual attraction volume of the actual unit.

In the constructed short-term dispatch optimization problem for shared bikes, the objective function is to maximize the increased total throughput of shared bikes in the area through which the dispatch vehicle passes, as compared to the case where no dispatch strategy is implemented. The decision variables are the action decisions of the dispatching car, including the moving direction of the dispatching car to the unit and the number of bicycles to be dispatched. The constraint conditions are that the total number of shared bicycles is conserved, the relation between the riding travel path flow and the riding OD flow is conserved, and the flow is not negative and is constrained by integers in the scheduling process. When the travel demand generated in the unit is greater than the shared bicycle available in the unit, the excess demand will be considered a lost demand.

In the embodiment of the present invention, step S4 includes the following sub-steps:

s42: determining average action by utilizing a one-hot coding mode;

The invention provides a shared bicycle dispatching framework of multi-agent reinforcement learning based on an average field theory according to the proposed shared bicycle dispatching optimization problem, and aims to enable an agent to change the learning riding requirement, adapt to a dynamic environment with randomness, realize dynamic decision optimization with collaboration and increase riding output.

In the embodiment of the invention, in step S41, the invention combines the shared bicycle transfer process model and the multiple intelligent reinforcement learning algorithm to construct a bicycle-sharing vehicle scheduling model. The invention defines that I is an intelligent agent label set and is equal to a label set of a dispatching car, S is a state set, AⁱTo represent

P is a transition probability function, R is a reward function and γ is a discount factor. Then the MDP-based reinforcement learning model contains six elements: g ═ G (I, S, a, P, R). Wherein I represents a dispatcher vehicle label variable and is equivalent to a label variable of an agent in the reinforcement learning algorithm, and I belongs to I ═ 0, 1.

Elements of a shared bicycle dispatch framework include states

Behavior parameter a_tAnd a reward function, wherein,

i is 0, N represents the state of the dispatch vehicle at time step variable t,

Average increase traffic rewarding function of dispatching vehicle

And dispatching vehicle overall increase go function

The concrete formula is as follows:

wherein ,α_rwThe scaling factor of the reward function is represented,

when indicating implementation of scheduling policy

The actual amount of the business trip of (c),

indicating when no scheduling policy is implemented

The actual amount of the business trip of (c),

when indicating implementation of scheduling policy

The actual amount of the business trip of (c),

show no tone being implementedDegree of strategy

The actual amount of the business trip of (c),

representing the time step variable tth

The number of vehicles to be scheduled in the interior,

representing the time step variable tth

The number of vehicles to be scheduled in the interior,

Status of state

In (1),

i is 0, and N represents the state of the dispatching vehicle at the time step variable t; the present invention assumes a state at time t

The supply quantity of the cell containing the agent i and the position number of the cell

The behavior parameter refers to the joint action of the dispatching strategy of dispatching vehicles at time t and satisfies a_t∈A＝A⁰×A¹×...×A^N. A is

Spatial set A ofⁱSet of vectors of (a).

The action strategy of agent i is equal to the dispatching strategy of dispatching vehicle, i.e.

Agent i refers to each dispatch vehicle tag in the city,

the method is an instant evaluation of the state and the generated action given by the environment in the interaction process of the agent i and the environment. The goal of agent i is to find the maximum reward

The present invention considers three ways of reward functions.

The reward function is an immediate evaluation of the state and the generated action given by the environment during the interaction between the agent i and the environment. The goal of agent i is to find the maximum prize value. Based on the shared bicycle scheduling problem, the present invention defines the variables used in the calculation of the reward function as follows:

α_rw-reward function scaling factor, dimensionless;

time step t, unit when no scheduling strategy is implemented

The actual running amount of the system is dimensionless;

time step t, unit when no scheduling strategy is implemented

The actual running amount of the system is dimensionless;

-time step t, unit

The number of internal dispatching vehicles is dimensionless;

-time step t, unit

The number of internal dispatching vehicles is dimensionless;

in the shared bicycle scheduling problem, the invention considers the reward function of three modes and takes the reward function as

In

The function of the reward that can be selected,

(1) increased amount of travel reward function obtained by the agent: the invention defines an Increased issued quantity (PA) reward function Obtained by an Agent, which is referred to as a PA reward function and is formed by

And (4) showing. Which represents the increased amount of shared bicycle travel that each agent obtains after performing an action. In the PA reward function, the reward in a unit that the agent has moved is considered to be the reward earned by the agent. The setting of the PA bonus function may result in the agent focusing on certain unit schedules.

(2) Average incremental bid reward function obtained by agent: the present invention defines an Average Increased Trip amount (APA) reward function Obtained by an Agent and denominates the APA reward function, comprising

And (4) showing. The APA reward function refers to the average increased number of shared bicycle trips that an agent obtains after performing an action.

Defined as the unit η through which the vehicle is dispatched^i,0 and η ^i,1 executing scheduling policy

The resulting increase in mean stroke yield.

(3) Global increased throughput obtained by agent: the invention defines an Obtained globally Increased Trip amount (accessed dependent created by Agent of Total Units, APTU) reward function of an intelligent Agent, and the function is named as APTU reward function, which is formed by

And (4) showing. The APTU reward function refers to the total area-wise increased shared bicycle traffic that all agents obtain after performing a joint action.

State transition probabilities refer to the state of each agent being updated as the time steps progress backwards, based on the combined actions and environmental interactions performed by the agents.

In the embodiment of the present invention, in step S42, a specific method for determining the average action is as follows: scheduling strategy for rewriting scheduling vehicle by using one-hot coding mode

Get the average motion

The calculation formulas are respectively as follows:

wherein ,

The present invention rewrites the inclusion of moving direction according to one-hot coding mode

And launch rate

Motion vector of

The actions are averaged for agent i when considering the remaining agent actions.

Joint action based on traditional multi-agent deep reinforcement learning algorithmAs a_tSatisfy the requirement of

Combined action a_tHas a dimension of (N +1) p_dimThe expansion of the intelligent agent with the increase of the number of the intelligent agents can lead to the problems of network complexity, calculation efficiency reduction, strategy optimization effect reduction and the like of reinforcement learning.

However, in the MFMARL algorithm, the joint mean action

Dimension of (d) is ρ_dim，

Dimension of (d) is ρ_dim，

Middle movement

Has a dimension of 2 ρ_dim. Therefore, the dimension of the joint action can be controlled according to the joint action processed in the MF mode, and the calculation efficiency is ensured. In particular, in a complex simulation context, where the number of agents is typically large, employing joint averaging actions based on MF theory may alleviate the problem of dimension explosion of joint actions caused by an increase in the number of agents.

In an embodiment of the present invention, the experience pool variables of the shared bicycle dispatching frame include an experience pool in step S43

And empirical tank capacity

The training-round related variables comprise a training-round number Episode and an updated training-round number Episode_upnetThe updated weight coefficient omega of the target network and the accumulated return discount factor gamma.

In the embodiment of the present invention, step S44 includes the following sub-steps:

s441: initializing an experience pool

Setting empirical tank capacity

Shared bicycle travel demand variable

And sharing bicycle slave η²Go out and arrive at³Travel flow rate of

s443: updating the status of each dispatching vehicle

And scheduling policy

s445: updating the status of the next time step for each dispatch vehicle

And average motion

Stable environment improvement of multi-agent: during the training process in a multi-agent environment, the strategy of agent i is constantly changing. For agent i, the constantly changing policies of other agents make agent i in an unstable environment. At this time, the transition probability from the state at one time to the state at the next time cannot be guaranteed to be a stable value. I.e. for any

Will exist in unstable environment

In the case of (a) in (b),

and

respectively represent t₁Time t and₂the policy of the agent i at the moment,

and

respectively, at time t₁And time t₂State transition probability of agent i.

If the agent i learns the action contents of all agents in the reinforcement learning process, the environment where the agent i is located can be changed into a stable environment. Since policies can be expressed by actions and states, when all intelligence is presentWhen the state and action content of the body are known, at time t₁And time t₂The state transition probabilities of agent i satisfy the following equations, respectively:

therefore, the temperature of the molten metal is controlled,

and

may be considered policy independent. When the strategy of the agent is changing, the transition probability from one state at a moment to the next still has stationarity, i.e. the following formula is still true:

therefore, within a model of known collective motion, for arbitrary ones

The agent i environment may be improved to a stable environment as shown by the following equation:

in this embodiment of the present invention, in step S442, a specific method for updating the shared bicycle operating environment when the policy execution state variable tr is equal to 0 includes: updating and calculating OD label variable (eta) of shared bicycle trip²，η³) Shared bicycle path traffic

η⁵Resulting actual bicycle-shared travel variable

η⁵The actual suction volume of the shared bicycle

And supply amount when policy execution state variable tr is 0

And supply amount when policy execution state variable tr is 1

Output as scheduling policy

Scheduling strategy for outputting next time step variable

Input the state of the dispatching car

Scheduling policy

And average motion

Output Q value function QⁱWherein the Q value function refers to a state-action value function in the reinforcement learning algorithm, which represents the accumulated reward value obtained by the dispatching vehicle; the value target network is obtained by constructing a neural network with parameters of

Inputting the state of the next time step variable of the dispatching vehicle

Scheduling strategy for next time step variable

And average motion of next time step variable

Output target Q value function

Estimation network and target network of value model, in computing QⁱAnd

the method adopts a forward propagation mode when the parameters are updated, and adopts a backward propagation mode when the parameters are updated, and is similar to the calculation mode of the strategy model.

And

probability sampling and selecting to obtain action

wherein ,

is QⁱThe form of the function is expressed as,

Is based on the motion space set

Updating

Substitution formula

According to

Probabilistic sampling and obtaining

Will be provided with

for Q-Learning based reinforcement Learning algorithms, the policy model for each agent is based on Q of agent iⁱWorth obtaining an action

And does not include a policy object model. The value model of each agent is divided into its value estimation network and a value target network that is structurally consistent with the value estimation network. Estimated network input global state s of a value model_tLast time step

And average motion of last time step

Output QⁱThe value is obtained. The value target network input layer is the global state value s of the next moment_t+1And the action value

And average motion value

And output as

The value is obtained. Estimation network and target network of value model, in computing QⁱAnd

the method is a forward propagation operation mode, and the method is a backward propagation mode when parameters are updated.

In the framework, each agent independently contains a policy model for decentralized action execution and a value model containing centralized state action information. The reinforcement Learning algorithm can be divided into two types, namely a strategy gradient method and a Q-Learning method.

Store to experience pool

And from experience pools

Randomly sampling a batch of samples

According to the sample

Updating the neural network parameters of the estimation network according to the accumulated return discount factor gamma and the loss function updating value, and updating the neural network parameters of the strategy model by using a gradient descent method; number of training rounds per interval Episode_upnetAccording to the updated weight coefficient omega of the target network, the parameter theta of the strategy model neural network is adjustedⁱParameters of neural network of sum value model

Neural network parameters for a sum value target network

wherein ,

a prize value indicative of a dispatch vehicle,

the strategy for representing the sample to be sampled,

which represents the average motion of the sampled samples,

a reward value representing a dispatch vehicle sampling a sample;

Store to experience pool

And again from experience pools

Randomly sampling a batch of samples

In a sample

In accordance with the cumulative reward discount factor gammaAnd updating the neural network parameters of the value estimation network by the loss function; number of training rounds per interval Episode_upnetAnd transferring the neural network parameters of the value model to the neural network parameters of the value target network according to the updated weight coefficient omega of the target network.

In the embodiment of the invention, according to the proposed shared bicycle scheduling optimization problem, the shared bicycle scheduling framework based on multi-agent reinforcement learning of an average field theory is proposed, and the aim is to enable the agents to change the learning riding requirements, adapt to a dynamic environment with randomness, and realize dynamic decision optimization with collaboration and increase of riding traffic.

The basic idea of framework construction is as follows:

(1) feasibility for solving shared bicycle scheduling problem by using reinforcement learning algorithm framework

In the constructed shared bicycle scheduling optimization problem, the information of the state of the supply amount at the current time is known, and the historical information of the past time is not required. That is, the status of the supply amount at the present time is related only to the status of the supply amount at the previous time and the policy action to be executed, and is independent of the status of the supply amount at the other time and the decision action condition. Therefore, the state of the supply amount at the time point that shares the bicycle scheduling optimization problem can be considered to have markov property. The shared bicycle scheduling optimization problem satisfies the assumption of markov property that the current state contains all information.

Thus, the shared bicycle scheduling problem can be translated into a Markov decision process. The Markov decision process can be solved through the reinforcement learning framework, so that the reinforcement learning framework is feasible to solve the scheduling optimization problem of the shared bicycles. The reinforcement learning does not need label data, and the self-learning of the high-dimensional mapping relation from the state without the model to the action can be realized. In view of the advantages of the reinforcement learning algorithm, the invention provides a shared bicycle dispatching optimization framework based on the reinforcement learning algorithm.

(2) Method for solving shared bicycle scheduling problem based on reinforced algorithm framework

When the multi-agent reinforcement learning algorithm is used for solving the problem of shared bicycle scheduling control, the problems of reduced learning effect, repeated scheduling of vehicles scheduled in the same area and the like occur, as shown in the following.

First, a multi-agent setting is established in a conventional reinforcement learning algorithm, and for each agent, its strategy is changing, resulting in an unstable environment. The unstable environment violates the markov state transition stability, causing policy estimation errors and reducing the efficiency of or failing to optimize the policy.

In the DQN algorithm, first, its agent i learns and selects the best strategy through the independent Q-learning algorithm. However, in a multi-agent environment, agent i updates the policy independently during the learning process, and does not consider the policies of other agents. I.e. the environment to which any agent i belongs is a non-stationary environment. At different time t, the probability of state transition

Not necessarily a stable value. However, the environment that the Q-learning algorithm converges to prove requires that the state transition probability matrix must have some stability. The case of an unstable environment is contrary to this assumption. Secondly, an agent i in the traditional DQN algorithm randomly samples data in an experience pool, and the extracted sample is used as a batch of training data of the neural network. The method can avoid the problem of state correlation in the original data. In a multi-agent environment, however, the strategy for agent i to optimize in the current state may be an ineffective strategy in the next state in the unstable environment. The single agent DQN algorithm is an inefficient learning process for learning processes in which there are invalid strategy samples and increases the likelihood of failure of the empirical replay function. Therefore, in a multi-agent environment, the conventional DQN algorithm cannot guarantee that the value function can converge to the optimal value function.

In a gradient strategy algorithm, in an unstable environment, as the number of agents increases, the variance generated by the algorithm will also increase, and the probability of the strategy being optimized in the correct direction will decrease.

Second, in the multi-agent deep reinforcement learning problem, agents are affected by the environment and other agents. During the learning process, the dimensionality of the joint action of the state-action value function will expand exponentially as the number of agents increases. Each agent estimates its own value function according to the joint strategy, and when the joint action space is large, the learning efficiency and the learning effect are reduced.

Thirdly, when the sensitivity of the estimation network of the reinforcement learning algorithm to the input variable data is not high, if the intelligent agent is influenced by a higher reward value or a higher penalty value, the intelligent agent learns the weight values of the parameters corresponding to the state and the action. This results in the agent being further desensitized to changes in state and action information and reselecting the same action. For a single agent, the condition that the action selection is too single can increase the monotonicity of the content of the learning sample data, and influence the accuracy of the fitting sum of the neural network in the estimation network. Inaccurate estimation of the dispatching vehicle strategy by the reinforcement learning algorithm can cause the dispatching efficiency to be reduced.

Fourthly, in the reinforcement learning algorithm of the multi-agent, the phenomenon that a plurality of dispatching vehicles carry shared bicycle vehicles in the same district can occur in the combined action. Non-cooperative scheduling strategies will result in scheduling inefficiencies even in some units with excessive accumulation of empty borrowable or shared bicycle vehicles due to over-scheduling.

(3) Basic idea of framework construction

According to the existing problems, for the shared bicycle dispatching optimization framework based on multi-agent deep reinforcement learning, the problem of dimension explosion caused by the number of agents in a stable environment needs to be considered, and the problem of efficient cooperative cooperation among the multi-agents is solved. The frame construction idea main points are as follows:

first, the intelligent agent learning structure of the framework:

in the distributed structure-based multi-agent deep reinforcement learning method, reinforcement learning methods can be divided into group reinforcement learning and independent reinforcement learning according to whether an agent considers the state and behavior information of the agent.

If each agent in a multi-agent system can be regarded as an independent single agent without communication capability, i.e. the agent does not consider the strategy selection of other agents in the strategy selection process, it can be called an independent learning type algorithm. In this case, the shared information data can be obtained between the agents by collective communication only after the information is fed back from the external environment. In contrast, in the group learning category algorithm, a plurality of agents are considered to be a combined group, and each agent also considers the strategy selection of other agents in the learning process.

The independent learning method can avoid dimension explosion problem in the communication process caused by the increase of the number of the intelligent agents, and can use a reinforcement learning algorithm under a static environment for reference, but has the defects of low convergence speed and long learning time. The group learning method has the advantages that the intelligent agents can be completely communicated to realize sufficient cooperation, but the search space is large and the learning time is long in the learning process. In order to realize the cooperation and communication between the intelligent agents, the strategy of other intelligent agents is considered in learning, and a group reinforcement learning method is adopted to construct a multi-agent-based deep reinforcement learning algorithm.

Second, the unstable environment of the frame improves:

for the problem of improving the unstable environment of the multi-agent, if the agent i learns the action contents of all agents in the learning process, the environment where the agent i is located can be changed into the stable environment. The state and action content of the agent is set herein to known information to improve the unstable environment.

Third, the dimension explosion problem in the framework caused by the increased number of agents improves:

mean Field gaming Theory (MFT) studies the differentiated gaming of group objects consisting of rational gambling partners. While the agent considers its state, the states of the remaining agents are still considered. The classic case of the mean field game is to train fish stocks to move in a cooperative manner. The fish do not pay attention to the swimming behavior of each fish in the group, but adjust the behavior of the fish in the adjacent area according to the behavior of the fish group. The mean field game theory can describe the behavioral response of the surrounding agents and the behavioral set of all agents by Hamilton-Jacobi-Bellman equation and Fokker-Planck-Kolmogorov equation. A Mean Field game theory-based Multi-Agent relationship Learning (MFMARL) algorithm assumes that the influence of all other agents on an Agent can be represented by a Mean distribution. MFMARL is suitable for the reinforcement learning problem of large-scale agents, and can simplify the interactive calculation among the agents. MFMARL solves the problem of the expansion of the space of the value function due to the increase in the number of agents. Thus, MFMARL is introduced herein in a shared bicycle dispatch framework and defines that each agent has the same discrete action space.

Fourthly, the sensitivity of the reinforcement learning algorithm to the change of the state and the action information is improved in the framework:

in order to improve the over-learning stability in a multi-agent deep reinforcement learning framework and improve the sensitivity of a neural network to the change of state and action information, the framework adopts a one-hot coding mode as the input of the neural network, and a hyperbolic tangent tanh (·) function is used for processing after a reward function value is scaled.

Fifth, the framework improves the efficient collaborative ability between agents:

in order to improve the efficient cooperative ability among the agents, the framework designs that the agent i can learn the action contents of all agents in the learning process, namely, the states and strategies of other agents. In addition, different forms of reward functions in the framework are discussed herein, whose impact on collaborative ability is studied.

Therefore, the invention aims at the problem of shared bicycle scheduling control, improves the stability of the learning environment of the multi-agent, and constructs the shared bicycle scheduling framework of the multi-agent reinforcement learning based on the mean field theory, which is learned by the multi-agent group.

The working principle and the process of the invention are as follows: the invention aims to establish a general frame of multi-agent deep reinforcement learning shared bicycle scheduling based on an average field theory so as to solve the problem of shared bicycle scheduling in a long-term scheduling process, a dynamic environment and a large-scale network. The method considers the stability of the state conversion of the multi-agent deep reinforcement learning algorithm, dimension explosion, the communication efficiency of the agents and the exploration behavior of the agents. A frame of a reinforcement learning algorithm is adopted, a coordinated and effective dynamic strategy is obtained in a high-dimensional action space, so that the travel requirement is met, and idle shared bicycles in a road are reduced. And defining the division of the area units by combining a reinforcement learning basic theory and the research of a shared bicycle dispatching system, and constructing a shared bicycle dispatching optimization model.

Aiming at a high-dimensional multi-main-body action space, a shared bicycle scheduling framework for multi-agent deep reinforcement learning based on an average field theory is provided. The lifting framework can be used to address long-term scheduling, dynamic environments, large-scale and complex networks. The framework does not need to predict the demand in advance or perform data processing, and is not influenced by the computational efficiency and accuracy of demand prediction. And the framework is not the optimal strategy for each time segment, but the overall optimization of the whole scheduling process, which considers the supply and demand changes of the future time segment and the influence of the scheduling decision on the supply and demand of the next time segment.

The invention has the beneficial effects that:

It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Those skilled in the art can make various other specific changes and combinations based on the teachings of the present invention without departing from the spirit of the invention, and these changes and combinations are within the scope of the invention.

Claims

1. A shared bicycle scheduling method based on deep reinforcement learning is characterized by comprising the following steps:

2. The deep reinforcement learning-based shared bicycle dispatching method according to claim 1, wherein in step S1, the specific method for dividing the dispatching area of the shared bicycles is as follows: dividing a dispatching area of the shared bicycle into a plurality of same equilateral hexagons as dispatching area units, and defining a global label variable eta of each dispatching area unit⁵Horizontal, horizontalA direction label variable m and a vertical direction label variable h, which satisfy the following relation:

in step S1, the operation environment variables of the shared bicycles include a time variable and a city fixed warehouse location set variable;

the time variables comprise a time step variable T, a time step variable set T and a maximum value variable T of a time step_maxWherein T ∈ T, T ═ {0,1_max}；

3. The deep reinforcement learning-based shared bicycle dispatching method according to claim 1, wherein in step S2, the dispatching variables of the shared bicycles comprise a policy execution state variable class, a supply and demand environment variable class, a riding trip variable class and a dispatching policy variable class;

the policy execution state variable class comprises a policy execution state variable tr, wherein tr belongs to {0,1 };

at time step t, the supply and demand environment variable class comprises a shared bicycle travel demand variable of a scheduling area unit

And scheduling zone units when the policy enforcement state variable tr is 1Shared bicycle supply variable

Shared bicycle slave η²Go out and arrive at³Travel flow rate of

η⁵Resulting actual bicycle-shared travel variable

and η⁵The actual suction volume of the shared bicycle

Dispatch vehicle arrival unit label variable

Variable moving direction to six adjacent regular hexagons

Dispatch ratio variable for a dispatch vehicle

Dispatching strategy of dispatching vehicle

Maximum capacity of cabin of dispatching vehicle

Dispatching vehicle slave

Pick up and place in

Shared bicycle number variable

Arrival of dispatching vehicle

And is

Increased revenue after dispatching vehicle implements dispatching strategy

4. The method for dispatching bicycles based on deep reinforcement learning of claim 1, wherein in step S3, the bicycle dispatching optimization model of the shared bicycles is specifically:

s.t.

The calculation formula is

Wherein T represents a time step, T_maxA maximum value variable representing a time step, i represents a dispatcher vehicle tag variable, N represents a dispatcher vehicle tag variable maximum value,

representing a dispatching strategy of a dispatching vehicle;

when the strategy execution state variable tr of the time step variable t is equal to 0, the action decision is carried out

Is calculated by the formula

wherein ,

express the dispatching car from

The moving direction variable of the six adjacent regular hexagons,

a dispatch ratio variable representing a dispatch vehicle;

when the strategy execution state variable tr of the time step variable t is equal to 0, the global label variable eta of the scheduling area unit⁵And a global label eta of a scheduling area unit where the OD starting point of the shared bicycle trip is located²Are the same, are shared fromOD tag variable (eta) of traveling²，η³) Shared bicycle path traffic

Is calculated by the formula

And is

Wherein INT (-) denotes a downward integer value,

a shared bicycle travel demand variable representing a dispatch area unit,

global label eta of scheduling area unit where OD starting point of shared bicycle trip is located²Starting trip flow rate

Is 1, and is calculated by the formula

Where T represents a set of time step variables, η³A global tag representing a unit where an OD end point of a shared bicycle trip is located;

according to the path flow

The calculation formula is

When the strategy execution state variable tr of the time step variable t is equal to 0, the global label variable eta of the scheduling area unit⁵And global label η of unit where OD end point of shared bicycle trip is located³When the same, will share the bicycle path flow

The calculation formula is

At time step variable t strategyWhen the execution state variable tr is equal to 0, the bicycle supply is shared

wherein ,

variation in time step_tWhen the strategy execution state variable tr is equal to 0, the dispatching vehicle is used for obtaining the unit label variable to be reached at the (t +1) time step

Is calculated by the formula

indicating and dispatching vehicle(t +1) starting cell tag variable for time step,

Is calculated by the formula

wherein ,

Is calculated by the formula

And is

Wherein min (-) represents taking the minimum value,

represents the maximum capacity of the cabin of the dispatching truck,

a dispatch ratio variable representing a dispatch vehicle;

The calculation formula is

5. The deep reinforcement learning-based shared bicycle scheduling method according to claim 1, wherein the step S4 comprises the following sub-steps:

s42: determining average action by utilizing a one-hot coding mode;

6. The deep reinforcement learning-based shared bicycle scheduling method of claim 5, wherein in the step S41, the elements of the shared bicycle scheduling frame comprise states

Behavior parameter a_tAnd a reward function, wherein,

represents the state of the dispatching vehicle at the variable t of the time step,

representing a scheduling strategy of a scheduling vehicle at a time step variable t;

Average increase traffic rewarding function of dispatching vehicle

And dispatching vehicle overall increase go function

The concrete formula is as follows:

wherein ,α_rwThe scaling factor of the reward function is represented,

when indicating implementation of scheduling policy

The actual amount of the business trip of (c),

indicating when no scheduling policy is implemented

The actual amount of the business trip of (c),

when indicating implementation of scheduling policy

The actual amount of the business trip of (c),

when indicating not to implement scheduling policy

The actual amount of the business trip of (c),

representing the time step variable tth

The number of vehicles to be scheduled in the interior,

representing the time step variable tth

The number of vehicles to be scheduled in the interior,

7. The deep reinforcement learning-based shared bicycle dispatching method according to claim 5, wherein in the step S42, the specific method for determining the average action is as follows: scheduling strategy for rewriting scheduling vehicle by using one-hot coding mode

Get the average motion

The calculation formulas are respectively as follows:

wherein ,

8. The deep reinforcement learning-based shared bicycle dispatching method according to claim 5, wherein in the step S43, the experience pool variables of the shared bicycle dispatching framework comprise an experience pool

And empirical tank capacity

9. The deep reinforcement learning-based shared bicycle scheduling method according to claim 5, wherein the step S44 comprises the following sub-steps:

s441: initializing an experience pool

Setting empirical tank capacity

Shared bicycle travel demand variable

And sharing bicycle slave η²Go out and arrive at³Travel flow rate of

s443: updating the status of each dispatching vehicle

And scheduling policy

s445: updating the status of the next time step for each dispatch vehicle

And average motion

And updating the dispatching vehicle entity according to the reward functionIncreased revenue after applying scheduling policy

10. The deep reinforcement learning-based shared bicycle scheduling method of claim 5, wherein in the step S442, the specific method for updating the shared bicycle operating environment when the policy execution state variable tr is 0 is as follows: updating and calculating OD label variable (eta) of shared bicycle trip²，η³) Shared bicycle path traffic

η⁵Resulting actual bicycle-shared travel variable

η⁵The actual suction volume of the shared bicycle

And supply amount when policy execution state variable tr is 0

In step S444, a specific method for updating the shared bicycle operating environment when the policy execution state variable tr is equal to 1 is as follows: updating a calculation behavior parameter a_tFrom eta of dispatching vehicle^i,0Pick up and place on reach η^i,1Shared bicycle number variable

And supply amount when policy execution state variable tr is 1

In the step S446, the reinforcement Learning algorithm adopts a policy gradient method or a Q-Learning method;

each dispatching vehicle comprises a strategy model and a value model; in the strategy gradient method, a strategy model of each dispatching vehicle comprises a strategy estimation network and a strategy target network; the strategy estimation network is constructed by a neural network, and the parameter of the neural network is thetaⁱThe input is the state of the dispatching car

Output as scheduling policy

The strategy target network is constructed by a neural network with parameters of

Inputting the state of the dispatching vehicle at the next time step variable

Scheduling strategy for outputting next time step variable

Input the state of the dispatching car

Scheduling policy

And average motion

Inputting the state of the next time step variable of the dispatching vehicle

Scheduling strategy for next time step variable

And average motion of next time step variable

Output target Q value function

And

probability sampling and selecting to obtain action

wherein ,

is QⁱThe form of the function is expressed as,

Is based on the motion space set

Updating

Substitution formula

According to

Probabilistic sampling and obtaining

Will be provided with

Store to experience pool

And from experience pools

Randomly sampling a batch of samples

According to the sample

Updating neural network parameters of a value estimation network based on a cumulative return discount factor gamma and a loss function

Updating neural network parameter theta of strategy model by gradient descent methodⁱ(ii) a Number of training rounds per interval Episode_upnetAccording to the updated weight coefficient omega of the target network, the parameter theta of the strategy model neural network is adjustedⁱParameters of neural network of sum value model

Neural network parameters for a sum value target network

wherein ,

representing the global state, s_t+1Representing the global state of the next time step, r_t ⁱA prize value indicative of a dispatch vehicle,

means average ofAction, s_t,jRepresenting the global state, s, in the sample_t+1,jRepresenting the global state of the next time step in the sample,

which represents the strategy in the sample of samples,

which represents the average motion in the sample of samples,

representing a reward value for the dispatch vehicle in the sample;

Store to experience pool

And again from experience pools

Randomly sampling a batch of samples

According to the sample

Neural network parameters of cumulative return discount factor gamma and loss function update value estimation network

Number of training rounds per interval Episode_upnetIn the method, the neural network parameters of the value model are adjusted according to the updated weight coefficient omega of the target network

Neural network parameters delivered to value target network

In (1).