CN113326993B

CN113326993B - Shared bicycle scheduling method based on deep reinforcement learning

Info

Publication number: CN113326993B
Application number: CN202110744265.2A
Authority: CN
Inventors: 肖峰; 涂雯雯
Original assignee: Southwestern University Of Finance And Economics
Current assignee: Southwestern University Of Finance And Economics
Priority date: 2021-04-20
Filing date: 2021-06-30
Publication date: 2023-06-09
Anticipated expiration: 2041-06-30
Also published as: CN113326993A

Abstract

The invention discloses a shared bicycle scheduling method based on deep reinforcement learning, which comprises the following steps: s1: dividing a dispatching area of the shared bicycle to obtain a dispatching area unit, and determining a running environment variable of the shared bicycle; s2: determining a scheduling variable of the shared bicycle; s3: constructing a vehicle dispatching optimization model of the shared bicycle; s4: based on a vehicle dispatching optimization model of the shared bicycle, constructing a shared bicycle dispatching frame by utilizing an average field theory, and completing shared bicycle dispatching by utilizing the shared bicycle dispatching frame. The shared bicycle scheduling optimization method based on reinforcement learning is beneficial to intelligently solving the problem of short-term and long-term scheduling optimization of shared bicycles of a large-scale road network under random and complex dynamic environments. The method considers the supply and demand change of the environment and the interaction effect of scheduling decision and the environment in future time, does not need to predict the demand in advance or perform manual data processing, and is not influenced by the calculation efficiency and accuracy of demand prediction.

Description

Shared bicycle scheduling method based on deep reinforcement learning

Technical Field

The invention belongs to the technical field of vehicle scheduling, and particularly relates to a shared bicycle scheduling method based on deep reinforcement learning.

Background

In the prior art, the problem of optimizing bicycle scheduling is generally solved by dividing the scheduling time into different time periods and independently searching the optimal scheduling strategy in each time period based on the division. However, the scheduling policy for the previous time period will affect the supply and demand environment for the next and future time periods. For the time period based isolated policy optimization method, the supply and demand conditions of the future time period and the influence caused by the implemented policy are not considered. Under this method, the optimal strategy in this time period does not necessarily promote the generation of a higher actual trip amount in the future time, and even causes the situation that the actual trip amount in the future is lower. Therefore, the optimal global strategy for the full scheduling time is not necessarily obtained by adopting the isolated strategy optimization method based on the time period.

Disclosure of Invention

The invention aims to solve the problem of shared bicycle scheduling in a long-term scheduling process, a dynamic environment and a large-scale network, and provides a shared bicycle scheduling method based on deep reinforcement learning.

The technical scheme of the invention is as follows: the shared bicycle scheduling method based on deep reinforcement learning comprises the following steps:

s1: dividing a dispatching area of the shared bicycle to obtain a dispatching area unit, and determining a running environment variable of the shared bicycle;

s2: determining a scheduling variable of the shared bicycle according to the running environment variable of the shared bicycle based on the scheduling area unit;

s3: constructing a vehicle dispatching optimization model of the shared bicycle according to dispatching variables of the shared bicycle;

s4: based on a vehicle dispatching optimization model of the shared bicycle, constructing a shared bicycle dispatching frame by utilizing an average field theory, and completing shared bicycle dispatching by utilizing the shared bicycle dispatching frame.

Further, in step S1, the specific method for dividing the scheduling area of the shared bicycle is as follows: dividing a scheduling area of a shared bicycle into a plurality of equilateral hexagons as scheduling area units, and defining a global tag variable eta of each scheduling area unit ⁵ A horizontal direction label variable m, and a vertical direction label variable h, which satisfy the following relation:

wherein ,η⁵ ∈M′，M′＝{0,1,...,((M+1) ² -1) }, M represents the maximum value of the horizontal direction tag variable or the vertical direction tag variable of the scheduling area unit, M' represents the unit tag set of the scheduling area unit;

In step S1, the running environment variables of the shared bicycle comprise time variables and city fixed warehouse position set variables;

the time variable comprises a time step variable T, a time step variable set T and a time step maximum value variable T _max Wherein T e T, t= {0,1, T _max }；

The city fixed warehouse location set variable comprises a fixed warehouse location set eta ^w 。

Further, in step S2, the scheduling variables of the shared bicycle include a policy execution state variable class, a supply and demand environment variable class, a riding trip variable class, and a scheduling policy variable class;

the policy execution state variable class includes a policy execution state variable tr, where tr is {0,1};

at time step t, the supply and demand environment variable class includes shared bicycle travel demand variables of the dispatch area unit

Shared bicycle supply variable ++of dispatch area unit when policy execution state variable tr=0>

And a shared bicycle supply variable of the schedule area unit when the policy execution state variable tr=1 +.>

At time step t, the riding trip variable class comprises a global tag eta of a scheduling area unit where an OD starting point of the shared bicycle trip is located ² Global label eta of scheduling area unit where OD destination point of bicycle travel is shared ³ OD tag variable (eta) sharing bicycle travel ² ,η ³ ) Sharing OD flow for bicycle travel

Sharing bicycle slave eta ² Go out and reach eta ³ Trip flow ratio +.>

η ⁵ The actual travel variable of the resulting shared bicycle +.>

and η⁵ The actual attraction variable of the shared bicycle +.>

At time step t, the scheduling strategy variable class comprises a scheduling vehicle tag set I, a scheduling vehicle tag variable I and a scheduling vehicle starting unit tag variable

The dispatcher reaches the unit tag variable->

Variable set kappa of movement direction of dispatching truck ₁ Scheduling ratio variable set κ ₂ The dispatcher is from->

Movement direction variable to adjacent six regular hexagons +.>

Schedule ratio variable of the scheduler->

Scheduling policy of the scheduler>

Maximum capacity of the cabin of the dispatching vehicle>

The dispatcher is from->

Pick up and put in->

Is a shared bicycle number variable->

Scheduler arrival->

And->

Belongs to eta ^w The time scheduling car is put in eta ^i, 1 ratio alpha of the number of shared self-vehicles to the number of vehicles in the cabin _wh Eta in the case where a scheduling policy is expected to have been applied to a conventional scheduling vehicle before the scheduling vehicle applies the scheduling policy ⁵ An estimated cumulative increase amount variable +.>

Increased benefit ∈after the scheduler enforces the scheduling policy>

And the total amount Z of shared bicycles stored in the urban fixed warehouse at the end of the scheduled cycle time _warehouse ；

Where i= {0,1,..n }, N represents the maximum value of the tag variable of the dispatching truck, I epsilon I kappa ₁ ＝{0,1,...,5}，κ ₂ ＝{0,0.25,0.5,0.75}，

Further, step S4 comprises the sub-steps of:

s41: determining elements of a shared bicycle scheduling framework based on a vehicle scheduling optimization model of the shared bicycle;

s42: determining average actions by using a one-hot coding mode;

s43: defining experience pool variables and training round related variables of a shared bicycle scheduling framework;

s44: based on the average field theory, the shared bicycle scheduling frame is constructed according to the elements, the average actions, the experience pool variables and the training round related variables of the shared bicycle scheduling frame.

Further, in step S41, the elements of the shared bicycle scheduling frame include a status

Behavior parameter a _t And a reward function, wherein->

Representing the state of the dispatcher at the time step variable t, -, etc. ->

A scheduling strategy of a scheduling vehicle when the time step variable t is represented;

the rewarding function comprises a rewarding function for actually increasing the travel amount of the dispatching truck

Average increase trip amount rewarding function of dispatching vehicle>

And the dispatching truck globally increases the travel function +.>

The specific formula is as follows: />

wherein ,α_rw Representing the scaling factor of the bonus function,

indicating +.>

Is to go out and go in- >

Indicating +.>

Is to go out and go in->

Indicating +.>

Is used for the actual travel amount of the vehicle,

indicating +.>

Is to go out and go in->

Represents the time step variable t +.>

Number of inner dispatch vehicles>

Represents the time step variable t +.>

Number of inner dispatch vehicles>

Representing η when implementing a scheduling policy ⁵ Is to go out and go in->

Represents eta when no scheduling policy is implemented ⁵ N represents the maximum value of the tag variable of the dispatcher,η ⁵ the global tag representing each scheduling region unit, M' represents the set of unit tags for the scheduling region unit.

Further, in step S42, the specific method for determining the average motion is as follows: method for rewriting scheduling policy of scheduling vehicle by using one-hot coding mode

Get average action +.>

The calculation formulas are respectively as follows:

wherein ,

a variable denoted 0 or 1 ρ _dim Dimension representing scheduling policy, N represents the maximum value of the variable of the tag of the scheduling car,/for the maximum value of the variable of the tag of the scheduling car>

Representing i _ne Action policy of i _ne A tag variable indicating a different scheduler than the scheduler i.

Further, in step S43, the experience pool variables of the shared bicycle scheduling frame include an experience pool

And experience pool capacity->

The training round related variables comprise training round number Episode and updating the training round number Episode _upnet The weight coefficient omega updated by the target network and the accumulated return discount factor gamma.

Further, step S44 includes the sub-steps of:

s441: initializing an experience pool

Setting experience pool capacity +.>

Weight coefficient omega updated by target network and rewarding function scaling coefficient alpha _rw Accumulated return discount factor gamma, initial given supply +.>

Sharing bicycle travel demand variables

And sharing bicycle slave eta ² Go out and reach eta ³ Trip flow ratio +.>

And based on the training round number Episode _upnet Steps S442-S445 are performed in a loop;

s442: updating the shared bicycle operating environment when the policy execution state variable tr=0;

s443: updating the status of each dispatch car

And scheduling policy->

S444: updating the shared bicycle operating environment when the policy execution state variable tr=1;

s445: updating the next time step status of each scheduler

PeaceAll act on->

And updating the increased benefit ++of the scheduler after implementing the scheduling policy according to the reward function>

S446: based on the update process of steps S442-S445, a shared bicycle scheduling frame is constructed using the reinforcement learning algorithm, and the shared bicycle scheduling is completed using the shared bicycle scheduling frame.

Further, in step S442, a specific method for updating the shared bicycle operating environment when the policy execution state variable tr=0 is as follows: updating and calculating OD tag variable (eta) of travel of shared bicycle ² ，η ³ ) Is a shared bicycle path flow

η ⁵ The actual travel variable of the resulting shared bicycle +.>

η ⁵ Is a shared bicycle with a shared actual attraction

And the supply amount ++when the policy execution state variable tr=0>

In step S444, the specific method for updating the shared bicycle operating environment when the policy execution state variable tr=1 is: updating the computational behavior parameters a _t The dispatching vehicle is from eta ^i,0 Pick up and put in reach eta ^i,1 Shared bicycle number variable

And the supply amount ++when the policy execution state variable tr=1>

In step S446, the reinforcement Learning algorithm adopts a strategy gradient method or a Q-Learning method;

each dispatching truck comprises a strategy model and a value model; in the strategy gradient method, a strategy model of each dispatching truck comprises a strategy estimation network and a strategy target network; the strategy estimation network is used for inputting the state of the dispatching vehicle by constructing a neural network

The output is scheduling policy->

The strategy target network inputs the state of the schedule car in the next time step variable by constructing a neural network>

Scheduling strategy for outputting next time step variable>

In the strategy gradient method and the Q-Learning method, a value model of each dispatching vehicle comprises a value estimation network and a value target network; the value estimation network is formed by constructing a neural network, and parameters of the neural network are as follows

Inputting the state of a dispatcher

Scheduling policy->

Average actions->

Output Q value function Q ⁱ Wherein the Q function refers to a state in a reinforcement learning algorithmA state-action value function representing a cumulative prize value attained by the dispatcher; the value target network is formed by constructing a neural network, and the parameters of the neural network are +.>

Inputting the state of the next time step variable of the dispatching truck>

Scheduling policy for next time step variable +.>

And the average action of the next time step variable +.>

Output target Q value function +.>

In the Q-Learning method, in the strategy model of the dispatching vehicle, the method is based on the formula

and />

Probability sampling and selecting to obtain action->

wherein ,/>

Representing t-1 time step i _ne Average action of i _ne Tag variable ω representing a different scheduler than scheduler i _d Representing policy parameters->

Is Q ⁱ Functional expression form,/->

Representing an action probability calculation function, A ⁱ Representation->

Action space set of (2) and according to +.>

Update->

Substitution formula

According to->

Probability sampling and obtaining->

Will->

Act as final choice of policy model of the dispatching truck; />

If the reinforcement learning algorithm adopts a strategy gradient method, the reinforcement learning algorithm will

Store to experience pool->

And is from experience pool->

A batch of samples is randomly sampled->

According to sample->

According to the accumulated return discount factor gamma and the loss function updating value, estimating the neural network parameters of the network, and updating the neural network parameters of the strategy model by using a gradient descent method; number of training rounds per interval Episode _upnet In the method, parameters theta of the strategy model neural network are updated according to the weight coefficient omega updated by the target network ⁱ And parameters of the neural network of the value model +.>

Neural network parameters respectively transferred to corresponding policy target networks +.>

And neural network parameters of the value target network +.>

wherein ,/>

Representing global state s _t+1 Global status representing next time step,/->

Representing the prize value of the dispatcher +.>

Represent the average action s _t,j Representing the global state of the sampled sample s _t+1,j Global state representing next time step of sampling a sample,/->

Strategy for representing sampled samples, ++>

Representing the average motion of the sampled samples, +.>

Representation samplingA prize value for a dispatch car of samples;

if the reinforcement Learning algorithm adopts the Q-Learning method, the method will

Store to experience pool->

And is further from experience pool->

A batch of samples is randomly sampled->

In the sample->

In the method, the neural network parameters of the value estimation network are updated according to the accumulated return discount factor gamma and the loss function; number of training rounds per interval Episode _upnet And (3) transmitting the neural network parameters of the value model into the neural network parameters of the value target network according to the updated weight coefficient omega of the target network.

The beneficial effects of the invention are as follows:

(1) The shared bicycle scheduling optimization method based on reinforcement learning is beneficial to intelligently solving the problem of short-term and long-term scheduling optimization of shared bicycles of a large-scale road network under random and complex dynamic environments. The method does not need to predict the demand in advance or conduct manual data processing, and is not affected by the calculation efficiency and accuracy of demand prediction. And this method is not the optimal strategy for each time period, but is an overall optimization method for the entire scheduling process, which takes into account the supply and demand changes of the future time period and the influence of the scheduling decision on the supply and demand of the next time period.

(2) The dynamic optimization scheduling strategy provided by the invention can improve the scheduling operation efficiency. The travel amount and the utilization rate of the shared bicycle are improved, and the loss amount of the shared bicycle user demand is reduced. The shared bicycle idle rate in the road is reduced, and the number of idle vehicles with excessively high accumulation in certain areas is reduced. The waste of shared resources is reduced, and the phenomenon of urban environment caused by stacking of a large number of idle vehicles is improved.

(3) The actual travel amount of the sharing bicycle users is increased, so that the sharing rate in the connection traffic can be improved, and the operation efficiency of the public transportation system is improved. The service quality of the shared bicycle is improved, the shared bicycle is encouraged to replace a motor vehicle for traveling, urban congestion and motor vehicle exhaust emission are reduced, and social welfare is increased.

Drawings

FIG. 1 is a flow chart of a shared bicycle scheduling method;

fig. 2 is a graph of regional units based on equal hexagonal divisions.

Detailed Description

Embodiments of the present invention are further described below with reference to the accompanying drawings.

Before describing particular embodiments of the present invention, in order to make the aspects of the present invention more apparent and complete, abbreviations and key term definitions appearing in the present invention will be described first:

OD traffic: and the traffic volume between the starting and ending points is indicated. "O" is derived from the English ORIGIN and refers to the departure place of the trip, and "D" is derived from the English DESTINATION and refers to the DESTINATION of the trip.

MFMARL algorithm: mean Field Multi-Agent Reinforcement Learning, multi-agent reinforcement learning algorithm based on the average Field game theory.

As shown in fig. 1, the present invention provides a shared bicycle scheduling method based on deep reinforcement learning, comprising the following steps:

In the embodiment of the invention, in the sequential decision problem, the interactive influence of the supply and demand environment and the implementation scheduling strategy is considered, and the problem of dynamic scheduling optimization of the shared bicycle is solved. According to the scheduling optimization period duration and whether to consider placing surplus shared self-vehicles to a city fixed warehouse, the scheduling optimization problem can be divided into two: the problem of optimizing the shared bicycle schedule without a fixed warehouse and the problem of optimizing the shared bicycle schedule with a fixed warehouse are considered.

In the scheduling optimization problem of the shared bicycle, the optimization target is not the actual trip amount pursued to be maximized in a single time period, the scheduling high efficiency of a single scheduling vehicle is not pursued, and the global trip amount is maximized through dynamic scheduling strategy optimization with cooperation in the whole scheduling period. Further, after achieving the above objectives, it is contemplated herein that the scheduling policy includes an act of placing excess vehicles into the warehouse in the presence of a city warehouse to achieve a reduction in redundant bicycles in the road.

The invention constructs a scheduling optimization process of the shared bicycle, as shown in fig. 3. In the dynamic dispatching optimization process, the invention considers the renting, riding, parking and dispatching processes of the bicycle and the change conditions of supply and demand. At each time step, each scheduler picks up a certain number of shared bicycles from the unit where it is currently located and loads them into the scheduler's cabin, then the scheduler then travels to the arrival unit and places all the shared bicycles in the cabin at the arrival unit.

In the embodiment of the present invention, as shown in fig. 2, a specific method for scheduling a bicycle is as follows: dividing a scheduling area of a shared bicycle into a plurality of equilateral hexagons as scheduling area units, and defining a global tag variable eta of each scheduling area unit ⁵ A horizontal direction label variable m, and a vertical direction label variable h, which satisfy the following relation:

in step S1, the running environment variables of the shared bicycle comprise time variables, city fixed warehouse position set variables and supply parameters;

The city fixed warehouse location set variable comprises a fixed warehouse location set eta ^w ；

Some units in a city may be set as stationary warehouses for the city, and when scheduling measures are implemented, the scheduling vehicle may launch idle shared bicycles into the stationary warehouses for the city of that unit. There is no upper limit to the capacity of the urban stationary warehouse, and bicycles scheduled for vehicle delivery in the warehouse will no longer be shipped out and given to the rider for use. When the unit where the destination of the travel of the rider is located is the city fixed warehouse, the shared bicycle parked by the rider will not be placed in the city fixed warehouse, but remain in the unit, and can still be used by the rider in future time. The set of variable locations of the city fixed warehouse may include the unit locations of the city fixed warehouse within all areas.

The supply parameter includes a first supply coefficient c _dis And a second supply coefficient c _initial The method comprises the steps of carrying out a first treatment on the surface of the Wherein the first supply coefficient c _dis The determining method of (1) comprises the following steps: calculating the demand value of each scheduling area unit in each time step variable according to the shared bicycle demand data, and taking 40 quantiles of the demand value in each 10 minutes in all scheduling area units as a first supply coefficient c _dis The method comprises the steps of carrying out a first treatment on the surface of the Second supply coefficient c _initial The determining method of (1) comprises the following steps: sharing bicycles within each dispatch area unitAnd a first supply coefficient c _dis As the second supply coefficient c _initial 。

It is assumed that the shared bicycle supply amount of each unit is uniformly distributed in the area at the initial time. In order to generalize the research of the influence of the supply quantity on the travel analysis, the invention does not directly give the number of the supply quantity, but determines the supply quantity value according to the relation between the supply quantity and the demand. The present invention defines a first supply coefficient c _dis And taking 40 quantiles of the riding demand sequence of all units every 10min for the demand value of each unit calculated according to the demand data at each time step. Here, 40 quantiles are chosen and the mean is not chosen, since the mean of the data is more susceptible to extrema. The 40 quantiles are the 40% number after all the values are arranged from small to large. It can avoid the problem that the analysis result is not generalizable because of the few higher demands in the sequence of riding demands of all units in each time step.

The present invention defines a second supply parameter c _initial For sharing the supply quantity of the bicycle and the first supply coefficient c in each unit at the initial time _dis Is a ratio of the relationship of (2). In the present invention, the second supply parameter c _initial Selected as five values, c _initial E {20,50,100,200,500,1000}. The invention defines the supply amount of the shared bicycle in each unit as c when the initial time is defined _dis And c _initial And take its integer value down.

In the embodiment of the invention, in step S2, the scheduling variables of the shared bicycle include a policy execution state variable class, a supply and demand environment variable class, a riding travel variable class and a scheduling policy variable class;

Scheduling area unit when policy execution state variable tr=0Is a shared bicycle supply variable->

(representing the number of shared bicycles available) and a shared bicycle supply variable for a dispatch area unit when the policy execution state variable tr=1

(indicating the number of shared bicycles available for use);

Sharing bicycle slave eta ² Go out and reach eta ³ Trip flow ratio +.>

η ⁵ The actual travel variable of the resulting shared bicycle +.>

and η⁵ The actual attraction variable of the shared bicycle +.>

η ² and η³ The conversion of the horizontal and vertical labels is the same as the label relation of the dispatch area unit; in units eta ² As a starting unit

The sum of the ratios of (1) is expressed as a unit eta ³ Is +.>

The sum of the ratios of (2) is 1; when eta ² ＝η ⁵ In units of eta ² Is +.>

The sum is equal to the actual travel amount->

When eta ³ ＝η ⁵ In units of eta ³ As a destination unit

The sum is equal to the actual suction amount->

/>

The dispatcher reaches the unit tag variable->

Movement direction variable to adjacent six regular hexagons +.>

Schedule ratio variable of the scheduler->

Scheduling policy of the scheduler >

Maximum capacity of the cabin of the dispatching vehicle>

The dispatcher is from->

Pick up and put in->

Is a shared bicycle number variable->

Scheduler arrival->

And->

Belongs to eta ^w The time scheduling car is put in eta ^i,1 The number of shared self-vehicles in the vehicle cabin being a ratio alpha of the number of vehicles in the vehicle cabin _wh Eta in the case where a scheduling policy is expected to have been applied to a conventional scheduling vehicle before the scheduling vehicle applies the scheduling policy ⁵ An estimated cumulative increase amount variable +.>

Increased benefit ∈after the scheduler enforces the scheduling policy>

When->

When the number is 0 to 5, the adjacent units respectively indicate that the dispatching truck moves to the left lower part, the right side, the left upper part, the left side, the right lower part and the right upper part, and the relation is as follows:

where M represents a horizontal direction tag variable of a scheduling area unit, h represents a vertical direction tag variable of the scheduling area unit, M' represents a unit tag set of the scheduling area unit, and T represents a time step variable set.

and η^i,1 The conversion of the horizontal and vertical labels is the same as the label relation of the dispatch area unit; the schedule ratio variable of the scheduler i may be four percentages, i.e. +. >

Unit representing pick up of dispatcher i +.>

Shared bicycle number is the unit at this time>

Percentage of the number of supplied amounts +.>

When cell eta ⁵ When the number of vehicles of the shared bicycle which is expected to be accumulated and tuned away is larger than the number of places, the accumulated increment amount is expected to be accumulated>

Negative, whereas, when the number of vehicles of the shared bicycle for which cumulative lift-off is expected to be smaller than or equal to the number of places, the cumulative lift-off is expected to be increased by +.>

Is non-negative.

In the embodiment of the present invention, in step S3, the vehicle scheduling optimization model of the shared bicycle specifically includes:

s.t.

/>

in the vehicle dispatching optimization model, the income increased after the dispatching vehicles implement the dispatching strategy

Maximizing objective function as a short-term schedule optimization problem for shared bicycles>

The calculation formula is as follows

Wherein T represents a time step, T _max Maximum variable representing time step, i representing the schedule tag variable, N representing the total number of schedule tag variables, +.>

Representing a scheduling strategy of a scheduling vehicle; />

The present invention sets the benefits to maximize the benefits of sharing the bicycle compared to the situation without any scheduling policy. The decision variable is the action decision of the dispatching vehicle

Including the direction of movement of the dispatcher->

And scheduling ratio

When the time step variable t strategy execution state variable tr=0, the decision variable is an action decision

Action decision->

The calculation formula of (2) is +.>

wherein ,/>

Representing the slave eta of the dispatching vehicle ^i,0 A movement direction variable to six adjacent regular hexagons,/->

A scheduling ratio variable representing a scheduling vehicle;

when the time step variable t policy execution state variable tr=0, the global tag variable η of the schedule area unit ⁵ And a global tag eta sharing a scheduling area unit where the OD starting point of the bicycle trip is located ² Sharing OD tag variable (η) for bicycle travel at the same time ² ，η ³ ) Is a shared bicycle path flow

The calculation formula of (2) is

And is also provided with

Wherein INT (·) represents a down integer value, < >>

Shared bicycle travel demand variable representing dispatch area unit, +.>

Representing the shared bicycle supply variable in the dispatch area unit when the initial given supply is t=0,/>

A shared bicycle supply variable in the dispatch area unit when representing the policy execution state variable tr=1,/->

Representing a shared bicycle slave eta ² Go out and reach eta ³ M' represents a unit tag set of the scheduling area unit;

supply of schedule area unit tag variable for policy enforcement status tr=0 +.>

And the generated demand of schedule area unit tag variable +.>

In (a) and (b)Smaller values. Path flow- >

Tag variable eta for a scheduling area unit not greater than the reference ⁵ Actual travel quantity and eta ² Is the starting point and eta ³ Travel flow rate for destination +.>

Is an integer of the product of (a).

Sharing bicycle slave eta ² Go out and reach eta ³ Is a ratio of the travel flow rate of (2)

Satisfies the conservation relation between the path flow and the OD flow, and uses the global label eta of the scheduling area unit where the OD starting point of the shared bicycle travel is located ² Trip flow ratio as origin +.>

The sum of (2) is 1, and the calculation formula is +.>

Wherein T represents a time step variable set, η ³ Global tags representing units where OD destination points of the shared bicycle travel are located;

according to the path flow

When the global tag variable η of the schedule area unit is the time step t policy execution state tr=0 ⁵ And a global tag eta sharing the unit where the OD starting point of the bicycle trip is located ² When the same, the bicycle path flow is shared +.>

Is taken as the global tag variable eta of the dispatch area unit ⁵ Is the actual travel amount of the shared bicycle +.>

The calculation formula is +.>

When the time step variable t policy execution state variable tr=0, the global tag variable η of the schedule area unit ⁵ And a global tag eta sharing the unit where the OD destination of the bicycle trip is located ³ At the same time, the OD tag variable (eta) of the travel of the shared bicycle ² ，η ³ ) Shared bicycle path flow

Is taken as the global tag variable eta of the dispatch area unit ⁵ Is the actual attraction of the shared bicycle>

The calculation formula is as follows

When the time step variable t policy execution state variable tr=0, the bicycle supply amount is shared

Updating according to the renting and parked sharing bicycle number in the travel activity of the rider, wherein the calculation formula is as follows

wherein ,/>

A shared bicycle supply variable representing a policy execution state variable tr=1 after a scheduling policy has been applied at time step (t-1), a +.>

Represents the time step eta of t ⁵ Is used for sharing the actual travel variable of the bicycle,/>

representing t time step eta ⁵ Is used for sharing the actual attraction variable of the bicycle;

when the time step variable t policy execution state variable tr=0, the unit tag variable that the scheduler will arrive at (t+1) time step

The calculation formula of (2) is

Wherein m represents a horizontal direction tag variable of the scheduling area unit, h represents a vertical direction tag variable of the scheduling area unit,/o>

A start unit tag variable representing the time step of the scheduler (t+1), +.>

Representing the slave eta of the dispatching vehicle ^i,0 Moving to the moving direction variable of six adjacent regular hexagons;

when the time step variable t policy execution state variable tr=0, η ⁵ The estimated cumulative increase/decrease amount of the supply amount of (2)

The calculation formula of (2) is

wherein ,

indicating that the (i-1) th dispatcher vehicle predicts the slave eta ⁵ Pick up shared number of bicycle, alpha _wh Indicating arrival eta of the dispatcher ^i,1 And eta ^i,1 Belongs to eta ^w The time scheduling car is put in eta ^i,1 Is a ratio of the number of shared self-vehicles to the number of vehicles in the cabin, eta ^w Representing a fixed warehouse location set;

when the time step t strategy execution state tr=0, the unit eta after the scheduling is expected to be implemented by the front (i-1) scheduling vehicle ⁵ The estimated cumulative increase/decrease amount of the supply amount of (2)

In cell eta ⁵ In (i-1) th dispatcher, the proposed implementation of the dispatching strategy does not involve the unit eta ⁵ I.e. +.>

And->

At this time, the cumulative increment amount +.>

Is 0. If the (i-1) th dispatcher vehicle predicts the slave unit eta according to the formula ⁵ Pick up->

Number of vehicles, unit eta ⁵ An estimated cumulative increase amount of the supply amount of +.>

Decrease->

If the (i-1) th dispatcher vehicle is expected to be placed +.>

Number of vehicles to Unit eta ⁵ And unit eta ⁵ Position set eta of tag value not belonging to city fixed warehouse ^w Then element eta ⁵ An estimated cumulative increase amount of the supply amount of +.>

Add->

If the (i-1) th dispatcher vehicle is expected to be placed +.>

Number of vehicles to Unit eta ⁵ And unit eta ⁵ The tag value belongs to the position set eta of the city fixed warehouse ^w Then element eta ⁵ An estimated cumulative increase amount of the supply amount of +.>

Add->

Position set eta of city fixed warehouse ^w When empty, the default is to disregard the situation of the urban fixed warehouse.

At time step t policy execution state variable tr=0, the dispatcher vehicle is driven from η ^i,0 Will be

The shared bicycle of the vehicle is picked up and put into the cabin of the dispatching vehicle, and the +.>

The shared bicycles of the vehicles are all put in eta ^i,1 In the number of vehicles picked up by the dispatching vehicle

The calculation formula of (2) is

And is also provided with

Wherein, min (·) represents taking the minimum, < ->

Represents the supply amount, η when the policy execution state variable tr=0 ^i,0 A start unit tag variable representing a dispatcher, +.>

Representing the maximum capacity of the cabin of the dispatcher vehicle, +.>

A scheduling ratio variable representing a scheduling vehicle;

indicating->

Requiring pick-up to account for the current cell->

Is->

Percentage number of bicycles. />

Means +.A means for indicating a time state tr=0 after scheduling is performed by the (i-1) scheduler before the time step t>

Is used for the remaining supply of (a). Shared bicycle number pickup vehicle number +.>

The number of vehicles that should be picked up based on the dispatch strategy, the remaining supply and the vehicle warehouse capacity +.>

And is an integer. The whole formula is +.>

Is a non-negative constraint.

When the time step t strategy execution state variable tr=1, the number of vehicles picked up according to the dispatching vehicle

Executing the scheduling policy and updating eta ⁵ Obtaining eta after implementing the scheduling policy ⁵ Is a shared bicycle supply variable->

The calculation formula is as follows

When the formula shows that the time step t time state tr=1, the dispatching truck i implements dispatching

Rear unit eta ⁵ Supply amount of->

In cell eta ⁵ In the dispatching vehicle i, the implementation of the dispatching strategy does not relate to the unit eta ⁵ I.e. +.>

And->

At the time of supply +.>

Is unchanged. If the dispatching vehicle i is from unit eta ⁵ Pick up->

Number of vehicles, unit eta ⁵ Supply amount of->

Decrease->

If the dispatching truck i is placed

Number of vehicles to Unit eta ⁵ And unit eta ⁵ Position set eta of tag value not belonging to city fixed warehouse ^w Then element eta ⁵ Supply amount of->

Add->

Conversely when cell eta ⁵ The tag value belongs to the position set eta of the city fixed warehouse ^w When the cell eta ⁵ Supply amount of->

Add->

And->

The number of shared bicycles is placed to the unit eta by default ⁵ Is a city fixed warehouse.

Total amount Z of shared bicycles stored in urban fixed warehouse _warehouse The calculation mode of (a) is that

The shared bicycle schedule optimization problem assumes that there are two conditions. Assume condition one: the invention assumes each dispatching truck and sequentially implements a dispatching strategy according to the dispatching truck number. The cyclist rents the shared bicycle according to the current supply quantity of the shared bicycle in the current unit, and the decision maker is based on the supply and demand after the trip is completed The environment formulates a scheduling policy, then implements the scheduling policy and updates the supply and demand environment. The policy execution state variable tr divides the time step into two time states, namely, a time state for updating the travel of the rider and formulating the scheduling policy, and a time state for implementing the scheduling policy. When tr=0, the rider rents, uses and parks the shared bicycle, updates the supply and demand change of each unit of the time step t, and then generates a scheduling strategy based on the supply and demand conditions after traveling; when tr=1, the scheduling policy is implemented and the supply and demand environment under the influence of the scheduling policy is updated. Assume condition two: to ensure that the dispatch vehicle will not travel to an area outside the area, the present invention assumes that the dispatch vehicle will stay in the current location when this occurs. I.e. when the dispatching vehicle arrives at the zone outside the zone at time step t+1 according to the dispatching strategy, the dispatching strategy will be updated to the unit for the dispatching vehicle to arrive at time step t+1

For the unit where time step t is located ∈ ->

Under this assumption, in policy +.>

Down (S)>

And->

The relationship needs to be satisfied at the same time>

And

in the shared bicycle scheduling optimization problem, the length of the scheduling period can be controlled by the setting of the time step set.The invention sets the scheduling period of the shared bicycle short-term scheduling optimization problem as one day, namely T _max Time set t= {0,1,..143 }. In the conventional scheduling method, the scheduling period is typically one day. In practice, however, the problems of uneven distribution and loss of demand of the shared bicycle become more serious with the increase of time, due to the limited number of scheduled vehicles. Especially at the later stages of the operation process, it is more challenging to formulate an effective strategy because the shared bicycle distribution is more unbalanced. Aiming at the problem of long-term operation scheduling, the invention also considers the problem of long-period dynamic scheduling optimization of the shared bicycle. The invention defines the dispatching cycle of the shared bicycle long-term dispatching optimization problem to be 7 days, T _max Time set t= {0,1,..1007 }.

The present invention is expected to further reduce the number of sharing bicycles that are excessively idle on the urban road, on the basis of satisfying the goal of increasing the travel amount of the sharing bicycles as much as possible. Thus, the invention provides a dynamic scheduling optimization problem of the shared bicycle comprising the urban warehouse. In this problem, the present invention assumes that there is a stationary warehouse in the city and that excess bicycles can be stored. During a dispatch operation, the dispatcher may move to a unit built in the warehouse and place the stored self-vehicles within the bins into the city warehouse. Wherein, when the city is fixed, the position set eta of the warehouse ^w When the method is empty, the problem of optimizing the shared bicycle dispatching is converted into the condition that city fixing is not considered, namely, the dispatching truck cannot put redundant shared bicycle into a city fixing warehouse in the dispatching process. Conversely, when the city is fixed, the position set eta of the warehouse ^w When the bicycle is not empty, the problem of shared bicycle scheduling optimization is that a city fixed warehouse exists by default, and redundant idle shared bicycle can be stored.

In the shared bicycle scheduling optimization problem, a policy execution state variable tr epsilon {0,1}, a scheduling car label set i= {0, 1..once, N }, a moving direction variable set κ of a scheduling car ₁ = {0,1,..5 }, schedule ratio variable set κ ₂ = {0,0.25,0.5,0.75}, set of unit tags M' = {0,1, (|m+1| ² -1) }. Root of Chinese characterAccording to the variable definition of the shared bicycle short-term dispatching optimization problem, under the condition of considering dispatching strategy implementation, the constraint condition of the actual unit trip amount and the actual attraction amount can ensure the conservation of the shared bicycle trip flow of each unit.

In the constructed short-term scheduling optimization problem of the shared bicycle, the objective function is to maximize the increased total trip amount of the shared bicycle in the area through which the scheduling vehicle passes, compared to the case where no scheduling policy is performed. The decision variables are action decisions of the dispatcher, including the moving direction of the dispatcher to the unit and the number of bicycles to be dispatched. The constraint conditions are conservation of the total number of shared bicycles, conservation of the relation between the riding travel path flow and the riding OD flow, and non-negative and integer constraint of the flow in the scheduling process. When travel demands are generated in the cell that are greater than the available shared bicycle in the cell, the excess demand will be considered a lost demand.

In an embodiment of the present invention, step S4 comprises the sub-steps of:

s42: determining average actions by using a one-hot coding mode;

According to the invention, a shared bicycle scheduling framework based on multi-agent reinforcement learning of an average field theory is provided according to the provided shared bicycle scheduling optimization problem, and the goal is to enable the agents to change the learning riding demands, adapt to a random dynamic environment, realize cooperative dynamic decision optimization and increase riding travel.

In the embodiment of the present invention, in step S41, the present invention combines the shared bicycle transfer process model and the multi-intelligent reinforcement learning algorithm to constructA vehicle dispatch model for a shared bicycle. The invention defines I as an agent label set and is equivalent to a label set of a dispatching vehicle, S is a state set and A ⁱ Representation of

P is a transition probability function, R is a reward function and γ is a discount factor. The MDP-based reinforcement learning model includes six elements: g= (I, S, a, P, R). Where I represents the dispatcher tag variable and the tag variable of the agent in the equivalent reinforcement learning algorithm, I e i= {0, 1..n }.

Elements of the shared bicycle scheduling framework include status

Behavior parameter a _t And a bonus function, wherein,

representing the state of the dispatcher at the time step variable t,

Average increase trip amount rewarding function of dispatching vehicle>

And the dispatching truck globally increases the travel function +.>

The specific formula is as follows: />

wherein ,α_rw Representing the scaling factor of the bonus function,

indicating +.>

Is to go out and go in->

Indicating +.>

Is to go out and go in->

Indicating +.>

Is used for the actual travel amount of the vehicle,

indicating +.>

Is to go out and go in->

Represents the time step variable t +.>

Number of inner dispatch vehicles>

Represents the time step variable t +.>

Number of inner dispatch vehicles >

Represents eta when no scheduling policy is implemented ⁵ N represents the maximum value of the tag variable of the dispatching vehicle, eta ⁵ The global tag representing each scheduling region unit, M' represents the set of unit tags for the scheduling region unit.

Status of

In (I)>

The state of the dispatching vehicle when the time step variable t is represented; the present invention assumes a state +.>

Comprising the supply amount of the cell in which the agent i is located and the location number of the cell, i.e., satisfying +.>

The behavior parameter refers to the joint action of the scheduling policy for scheduling vehicles at time t and satisfies a _t ∈A＝A ⁰ ×A ¹ ×...×A ^N . A is

Spatial set A of (2) ⁱ Is defined in the set of vectors of (a). />

The action strategy of agent i is equal to the dispatching strategy of dispatching car, namely +.>

Agent i refers to each dispatch vehicle tag in the city, r _t ⁱ The method is an instant evaluation of states and generated actions given by the environment in the interaction process of the intelligent agent i and the environment. The goal of agent i is to find the maximized rewards

The present invention contemplates three ways of rewarding functions.

The rewarding function refers to an instant evaluation of the status and the generated actions given by the environment in the process of interaction of the agent i with the environment. The goal of agent i is to find the maximum prize value. Based on the shared bicycle scheduling problem, the variables used in the calculation of the defined reward function of the present invention are as follows:

α _rw -a bonus function scaling factor, dimensionless;

-time step t, element +.>

Is dimensionless;

-time step t, element +.>

Is dimensionless;

-time step t, element->

The number of internal dispatching vehicles is dimensionless;

-time step t, element->

The number of internal dispatching vehicles is dimensionless;

in the shared bicycle scheduling problem, the present invention considers three types of reward functions as

Middle->

Selectable reward function, ++>

(1) Increased trip bonus function obtained by the agent: the invention defines an increased trip amount (Increased Trip Production Obtained by Agent, PA) rewarding function obtained by the intelligent agent, which is referred to as PA rewarding function, consisting of

And (3) representing. Which represents the increased travel volume of the shared bicycle obtained by each agent after the action is performed by that agent. In the PA bonus function, the rewards within the cell that an agent moves through are all considered rewards that the agent gets. The setting of the PA reward function may result in the agent focusing on certain unit schedules.

(2) Average incremental trip bonus function obtained by agent: the invention defines an average increased travel volume (Average Increased Trip Production Obtained by Agent, APA) reward function obtained by the agent and is named APA reward function, consisting of

And (3) representing. The APA reward function refers to the average incremental amount of travel of the shared bicycle that each agent obtains after the agent performs an action. />

Defined as the unit eta through which the scheduled vehicle passes ^i,0 and η^i,1 Execute scheduling policy->

The resulting increase in average stroke yield.

(3) Globally increased travel amount obtained by the agent: the invention defines a globally increased trip amount (Average Increased Trip Production Obtained by an Agent of Total Units, APTU) rewarding function obtained by the agent and is named as APTU rewarding function, which is composed of

And (3) representing. The APTU reward function refers to the total area of increased travel of the shared bicycle that all agents acquire after performing the joint action.

The state transition probability refers to the state of each agent that will be updated according to the joint actions and environmental interactions performed by the agent as the time step advances backward.

In the embodiment of the present invention, in step S42, the specific method for determining the average action is as follows: method for rewriting scheduling policy of scheduling vehicle by using one-hot coding mode

Get average action +.>

The calculation formulas are respectively as follows:

wherein ,

a variable denoted 0 or 1 ρ _dim Dimension representing scheduling policy, N represents the maximum value of the variable of the tag of the scheduling car,/for the maximum value of the variable of the tag of the scheduling car >

The invention rewrites the moving direction according to the one-hot coding mode

And administration ratio->

Motion vector of (a)

The actions are averaged for agent i when considering the remaining agent actions.

Combined action a based on traditional multi-agent deep reinforcement learning algorithm _t Satisfy the following requirements

Joint action a _t Has a dimension of (n+1) ρ _dim Which expands as the number of agents increasesThe problems of network complexity, calculation efficiency reduction, strategy optimization effect reduction and the like of reinforcement learning are caused.

However, in the MFMARL algorithm, the joint average action

Is p _dim ，/>

Is p _dim ，

Middle action->

Is 2 p _dim . Therefore, the dimension of the joint action can be controlled according to the joint action processed in the MF mode, and the calculation efficiency is ensured. In particular, in a complex simulation context, the number of agents is typically a large value, and the use of joint average actions based on MF theory may alleviate the problem of dimensional explosions of joint actions caused by the increase in the number of agents.

In an embodiment of the present invention, in step S43, the experience pool variables of the shared bicycle schedule frame include an experience pool

And experience pool capacity- >

In an embodiment of the present invention, step S44 includes the sub-steps of:

s441: initializing an experience pool

Setting experience pool capacity +.>

Shared bicycle travel demand variable +.>

And sharing bicycle slave eta ² Go out and reach eta ³ Trip flow ratio +.>

s443: updating the status of each dispatch car

And scheduling policy->

s445: updating the next time step status of each scheduler

Average actions->

Stable environment improvement of multi-agent: during the training process of the multi-agent environment, the strategy of agent i is constantly changing. For agent i, the policy of other agents is constantly changing, so that agent i is in an unstable environment. At this time, the transition probability from the state at one time to the state at the next time cannot be ensured to be a stable value. I.e. for any one

Will be present in an unstable environment>

In the case of (a) the (b),

and />

Respectively represent t ₁ Time sum t ₂ Policy of agent i at moment, +.>

And

respectively at time t ₁ And time t ₂ State transition probability of agent i.

If the intelligent agent i knows the action content of all the intelligent agents in the reinforcement learning process, the environment where the intelligent agent i is located can be changed into a stable environment. Since policies can be expressed by actions and states, when the states and action contents of all agents are known, at time t ₁ And time t ₂ The state transition probabilities of the agent i satisfy the following formulas, respectively:

thus, the first and second substrates are bonded together,

and />

It can be considered as policy independent. When the policy of the agent is changing continuously, the transition probability from the state at one moment to the state at the next moment still has stationarity, i.e. the following formula is still true:

Thus, for any given model of focused actions

The agent i environment can be improved to a stable environment as shown in the following formula: />

In the embodiment of the present invention, in step S442, a specific method for updating the shared bicycle operating environment when the policy execution state variable tr=0 is: updating and calculating OD tag variable (eta) of travel of shared bicycle ² ，η ³ ) Is a shared bicycle path flow

η ⁵ The actual travel variable of the resulting shared bicycle +.>

η ⁵ The actual attraction variable of the shared bicycle +.>

And the supply amount ++when the policy execution state variable tr=0>

And the supply amount ++when the policy execution state variable tr=1>

The output is scheduling policy->

Scheduling strategy for outputting next time step variable>

Inputting the state of a dispatcher

Scheduling policy->

Average actions->

Output Q value function Q ⁱ Wherein the Q-value function refers to a state-action value function in a reinforcement learning algorithm that represents a cumulative prize value attained by the scheduler; the value target network is formed by constructing a neural network, and the parameters of the neural network are +.>

Inputting the state of the next time step variable of the dispatching truck>

Scheduling policy for next time step variable +.>

And the average action of the next time step variable +.>

Output target Q value function +.>

Evaluation network and target network of value model, in calculating Q ⁱ and />

The running mode of forward propagation is adopted, the backward propagation mode is adopted when the parameters are updated, and the calculation mode of the strategy model is adopted.

and />

Probability sampling and selecting to obtain action- >

wherein ,/>

Is Q ⁱ Functional expression form,/->

Representing an action probability calculation function, A ⁱ Representation->

Action space set of (2) and according to +.>

Update->

Substitution formula

According to->

Probability sampling and obtaining->

Will->

Act as final choice of policy model of the dispatching truck; />

For the reinforcement Learning algorithm based on Q-Learning, the policy model of each agent is based on the Q of agent i ⁱ The value gets the action

And does not include a policy objective model. The value model of each agent is divided into its value estimation network and a value target network consistent with the value estimation network structure. Estimating network input global state s of value model _t Action of last time step->

And the average action of the last time step +.>

Output Q ⁱ Values. The value target network input layer is the global state value s of the next moment _t+1 Action value->

Average action value ∈ ->

And output as +.>

Values. Evaluation network and target network of value model, in calculating Q ⁱ and />

The forward propagation mode is used when the parameters are updated, and the backward propagation mode is used when the parameters are updated.

In the framework, each agent independently contains a policy model for decentralized action execution and a value model containing centralized state action information. The reinforcement Learning algorithm can be divided into two types, namely a strategy gradient method and a Q-Learning method.

Store to experience pool->

And is from experience pool->

A batch of samples is randomly sampled->

According to sample->

And neural network parameters of the value target network +.>

wherein ,/>

Representing global state s _t+1 Global status representing next time step,/->

Representing the prize value of the dispatcher +.>

Strategy for representing sampled samples, ++>

Representing the average motion of the sampled samples, +.>

A prize value representing a dispatcher of the sampled samples;

Store to experience pool->

And is further from experience pool->

A batch of samples is randomly sampled->

In the sample- >

According to the embodiment of the invention, a shared bicycle scheduling framework based on multi-agent reinforcement learning of an average field theory is provided according to the provided shared bicycle scheduling optimization problem, and the aim is to enable an agent to change the riding demand, adapt to a random dynamic environment, realize cooperative dynamic decision optimization and increase riding travel. The basic idea of frame construction is as follows:

(1) Feasibility of solving shared bicycle scheduling problem by using reinforcement learning algorithm framework

In the constructed shared bicycle scheduling optimization problem, the information of the state of the supply amount at the current moment is known, and the history information of the past time is not needed. That is, the state of the supply amount at the present time is related to only the state of the supply amount at the previous time and the policy action executed, and is independent of the state of the supply amount at the other time and the decision action condition. Therefore, it can be considered that the state of the supply amount at this time sharing the bicycle schedule optimization problem has markov. The shared bicycle schedule optimization problem satisfies the assumption of markov, i.e. the current state contains all information.

Thus, the shared bicycle scheduling problem may be converted into a Markov decision process. The Markov decision process can be solved by a reinforcement learning framework, so that it is feasible to solve the shared bicycle scheduling optimization problem with the reinforcement learning framework. The reinforcement learning does not need label data, and can realize the self-learning of the high-dimensional mapping relation from the state without a model to the action. In view of the advantages of the reinforcement learning algorithm, the invention provides a shared bicycle scheduling optimization framework based on the reinforcement learning algorithm.

(2) Solving the problem existing in the shared bicycle scheduling problem based on the reinforcement algorithm framework

When the multi-agent reinforcement learning algorithm is used for solving the problem of sharing bicycle scheduling control, the problems of reduced learning effect, repeated scheduling of vehicles scheduled in the same area and the like can occur, as shown below.

First, multi-agent settings are established in a traditional reinforcement learning algorithm, and for each agent, the strategy is constantly changing, resulting in an unstable environment. Unstable environments violate markov state transition stability, causing policy estimation errors and reducing policy optimization efficiency or policy optimization failure.

In the DQN algorithm, first, its agent i learns and selects the best strategy by means of a separate Q-learning algorithm. However, in a multi-agent environment, agent i updates the policies independently during learning, and does not consider the policies of other agents. I.e. the environment to which any agent i belongs is a non-stationary environment. At different times t, probability of state transition

And not necessarily a stable value. However, the environment where the Q-learning algorithm converges proves requires that the state transition probability matrix must have a certain stability. The case of an unstable environment is contrary to this assumption. Second, in the conventional DQN algorithm, the agent i performs a random sampling operation on the data in the experience pool, and the sampled sample is used as a batch of training data of the neural network. The method can avoid the problem of state correlation in the original data. However, in a multi-agent environment, agent i may be an ineffective strategy in the state of the next unstable environment in the current state optimization strategy. The single agent DQN algorithm is an inefficient learning process for learning processes where invalid policy samples exist and increases the likelihood of failure of the empirical playback function. Therefore, in a multi-agent environment, the conventional DQN algorithm cannot guarantee that the value function can converge to the optimal value function.

In a gradient strategy algorithm, in an unstable environment, as the number of agents increases, the variance produced by the algorithm will also increase and the probability of the strategy being optimized in the correct direction will decrease.

Second, in the multi-agent deep reinforcement learning problem, agents are affected by the environment and other agents. During learning, the dimension of the joint action of the state-action value function will exponentially expand as the number of agents increases. Each agent estimates its own value function according to the joint strategy, and when the joint action space is large, the learning efficiency and learning effect will be reduced.

Third, if the estimation network of the reinforcement learning algorithm has low sensitivity to the input variable data, if the agent is affected by a higher reward value or a higher penalty value, the agent learns the weight of the parameters corresponding to the state and the action. This results in a further reduction in the sensitivity of the agent to changes in status and motion information and re-selecting the same motion. For a single agent, the case of too single action selection will increase the content monotonicity of the learning sample data, affecting the accuracy of the neural network fit sum in the estimated network. Inaccurate estimation of the scheduling vehicle strategy by the reinforcement learning algorithm will result in reduced scheduling efficiency.

Fourth, in the reinforcement learning algorithm of multiple agents, a phenomenon that multiple dispatching vehicles carry and share bicycle vehicles in the same cell occurs in the joint action. Non-cooperative scheduling strategies will result in scheduling inefficiencies, even in situations where certain units are over-scheduled and no vehicles are available for borrowing or sharing bicycle vehicles are over-piled.

(3) Basic idea of frame construction

According to the problems, the shared bicycle scheduling optimization framework based on multi-agent deep reinforcement learning needs to consider the dimension explosion problem caused by the number of agents in a stable environment, and solves the problem of high-efficiency cooperative cooperation among the multi-agents. The frame construction idea is as follows:

First, the agent learning structure of frame:

in the multi-agent deep reinforcement learning method based on the distributed structure, the reinforcement learning method may be classified into group reinforcement learning and independent reinforcement learning according to whether an agent considers state and behavior information of the agent.

If each agent in the multi-agent system can be regarded as an independent single agent without communication capability, that is, the agent does not consider the policy selection of other agents in the policy selection process, the algorithm can be called an independent learning type algorithm. In this case, shared information data can be obtained only between the agents by the collective communication method after the information is fed back by the external environment. Conversely, in the algorithm of the group learning category, multiple agents will be considered as a combined group, each agent also considering the policy choices of other agents during the learning process.

The independent learning method can avoid the problem of dimension explosion in the communication process caused by the increase of the number of the intelligent agents, and the method can reference the reinforcement learning algorithm in the static environment, but has the defects of slow convergence speed and long learning time. The group learning method has the advantages that the intelligent agents can fully communicate with each other to realize full cooperation, but the search space is large and the learning time is long in learning. In order to realize cooperation and communication among agents, strategies of other agents are considered in learning, and a group reinforcement learning method is adopted to construct a multi-agent-based deep reinforcement learning algorithm.

Second, the unstable environment of the frame improves:

for the problem of unstable environment improvement of multiple agents, if the agent i learns the action content of all agents in the learning process, the environment where the agent i is located can be changed into a stable environment. The status and activity content of the agent are set as known information herein to improve the unstable environment.

Third, the dimensional explosion problem caused by the increase in the number of agents in the framework improves:

average field gambling theory (Mean Field Theory, MFT) studies differentiated gambling of group objects consisting of rational gambling parties. The state of the other agents is still considered while the agents consider their own state. Classical cases of average field gaming are training shoals to swim in a collaborative manner. The fish do not pay attention to the swimming behavior of each fish in the population, but rather adjust their own behavior with the behavior of the shoal in the neighborhood. Average field gaming theory can describe the behavioral response of surrounding agents and the behavioral set of all agents by the Hamilton-Jacobi-Bellman equation and the Fokker-Planck-Kolmogorov equation. The Mean Field game theory-based Multi-agent reinforcement learning algorithm (Mean Field Multi-Agent Reinforcement Learning, MFMARL) algorithm assumes that the impact of all other agents on a certain agent can be represented by a Mean distribution. The MFMARL is suitable for the reinforcement learning problem of large-scale agents, and can simplify the interactive calculation among the agents. MFMARL solves the problem of expansion of the space of the value function due to an increase in the number of agents. Thus, MFMARL is incorporated herein in a shared bicycle scheduling framework and defines that each agent has the same discrete action space.

Fourth, the sensitivity of the reinforcement learning algorithm to state and motion information changes is improved in the framework:

in order to improve over-learning stability in a multi-agent deep reinforcement learning framework and improve the sensitivity of a neural network to state and action information changes, the framework adopts a one-hot coding mode as input of the neural network, and the reward function value is scaled and then processed by a hyperbolic tangent tanh (·) function.

Fifth, the framework improves the high-efficiency cooperative ability among the agents:

in order to improve the high-efficiency cooperative capability among the agents, the framework designs the action content of all the agents to be known in the learning process of the agent i, namely, the states and strategies of other agents can be learned. Furthermore, different forms of reward functions in the framework are discussed herein, the impact of which on the collaborative capability is studied.

Therefore, the invention aims at the problem of shared bicycle scheduling control, improves the stability of the learning environment of the multi-agent, and constructs the shared bicycle scheduling framework of multi-agent reinforcement learning based on the average field theory for multi-agent group learning.

The working principle and the working process of the invention are as follows: the invention aims to establish a general framework for shared bicycle scheduling based on multi-agent deep reinforcement learning of average field theory, so as to solve the shared bicycle scheduling problems of long-term scheduling process, dynamic environment and large-scale network. The stability of state transition of the multi-agent deep reinforcement learning algorithm, dimensional explosion, agent communication efficiency and agent exploration behaviors are considered. And a framework of a reinforcement learning algorithm is adopted, and a coordinated and effective dynamic strategy is obtained in a high-dimensional action space, so that travel requirements are met, and idle shared bicycles in a road are reduced. And combining reinforcement learning basic theory and shared bicycle scheduling system research, defining the division of area units, and constructing a shared bicycle scheduling optimization model.

Aiming at a high-dimensional multi-body action space, a shared bicycle scheduling frame for multi-agent deep reinforcement learning based on an average field theory is provided. The framework can be used to address long-term scheduling, dynamic environments, large-scale and complex networks. The framework does not need to predict the demand in advance or process the data, and is not affected by the calculation efficiency and accuracy of the demand prediction. And the framework is not the best strategy for each time period, but rather is an overall optimization of the overall scheduling process that takes into account supply and demand variations for future time periods and the impact of scheduling decisions on supply and demand for the next time period.

The beneficial effects of the invention are as follows:

Those of ordinary skill in the art will recognize that the embodiments described herein are for the purpose of aiding the reader in understanding the principles of the present invention and should be understood that the scope of the invention is not limited to such specific statements and embodiments. Those of ordinary skill in the art can make various other specific modifications and combinations from the teachings of the present disclosure without departing from the spirit thereof, and such modifications and combinations remain within the scope of the present disclosure.

Claims

1. The shared bicycle scheduling method based on deep reinforcement learning is characterized by comprising the following steps of:

s4: based on a vehicle dispatching optimization model of the shared bicycle, constructing a shared bicycle dispatching frame by utilizing an average field theory, and completing shared bicycle dispatching by utilizing the shared bicycle dispatching frame;

in the step S1, the specific method for dividing the scheduling area of the shared bicycle is as follows: dividing a scheduling area of a shared bicycle into a plurality of identical equilateral hexagons as scheduling area units, and defining a global tag variable eta of each scheduling area unit ⁵ A horizontal direction label variable m, and a vertical direction label variable h, which satisfy the following relation:

In the step S1, the running environment variables of the shared bicycle comprise a time variable and a city fixed warehouse position set variable;

In the step S2, the scheduling variables of the shared bicycle include a policy execution state variable class, a supply and demand environment variable class, a riding trip variable class and a scheduling policy variable class;

the policy execution state variable class comprises a policy execution state variable tr, wherein tr is {0,1};

at time step t, the supply and demand environment variable class comprises a shared bicycle travel demand variable of a dispatching area unit

Sharing bicycle slave eta ² Go out and reach eta ³ Trip flow ratio +.>

η ⁵ The actual travel variable of the resulting shared bicycle +.>

and η⁵ The actual attraction variable of the shared bicycle +.>

The dispatcher reaches the unit tag variable->

Movement direction variable to adjacent six regular hexagons +.>

Schedule ratio variable of the scheduler->

Scheduling policy of the scheduler>

Maximum capacity of the cabin of the dispatching vehicle>

The dispatcher vehicle is from eta _t ^i,0 Pick up and put in->

Is a shared bicycle number variable->

Scheduler arrival->

And->

Increased benefit ∈after the scheduler enforces the scheduling policy>

In the step S3, the vehicle scheduling optimization model of the shared bicycle specifically includes:

s.t.

/>

The calculation formula is +.>

Wherein T represents a time step, T _max Maximum value variable representing time step, i representing the schedule tag variable, N representing the maximum value of the schedule tag variable, +.>

Representing a scheduling strategy of a scheduling vehicle;

action decision when time step variable t policy execution state variable tr=0

Calculation of (2)The formula is

wherein ,/>

Indicating the dispatcher from->

A movement direction variable to six adjacent regular hexagons,/->

A scheduling ratio variable representing a scheduling vehicle;

The calculation formula of (2) is

And is also provided with

Wherein INT (·) represents a down integer value, < > >

Shared bicycle travel demand variable representing dispatch area unit, +.>

Representation ofShared bicycle supply variable in dispatch area unit when policy execution state variable tr=1, +.>

global tag eta of unit of scheduling area where OD starting point of shared bicycle travel is located ² Trip flow rate ratio as starting point

The sum of (2) is 1, and the calculation formula is +.>

according to the path flow

When the global tag variable η of the schedule area unit is the time step t policy execution state tr=0 ⁵ And a global tag eta sharing the unit where the OD starting point of the bicycle trip is located ² When the same, the bicycle path flow is shared

The calculation formula is +.>

When the time step variable t policy execution state variable tr=0, the global tag variable η of the schedule area unit ⁵ Sharing bicycle travelGlobal tag η of the cell where the OD destination is located ³ When the same, the bicycle path flow is shared

The calculation formula is as follows

wherein ,/>

Represents the time step eta of t ⁵ Is a shared bicycle actual travel variable, +.>

The calculation formula of (2) is

The calculation formula of (2) is +.>

wherein ,/>

The shared bicycles of the vehicles are all put in eta ^i,1 In the number of vehicles picked up by the dispatcher +.>

The calculation formula of (2) is

And is also provided with

Wherein, min (·) represents taking the minimum, < ->

Representing the maximum capacity of the cabin of the dispatcher vehicle, +.>

A scheduling ratio variable representing a scheduling vehicle;

The calculation formula is as follows

Said step S4 comprises the sub-steps of:

s42: determining average actions by using a one-hot coding mode;

2. The method for sharing a bicycle schedule according to claim 1, wherein in step S41, the elements of the shared bicycle schedule frame include states

Behavior parameter a _t And a bonus function, wherein,

representing the state of the dispatcher at the time step variable t,

Average increase trip amount rewarding function of dispatching vehicle>

And the dispatching truck globally increases the travel function +.>

The specific formula is as follows: />

wherein ,α_rw Representing the scaling factor of the bonus function,

indicating +. >

Is to go out and go in->

Indicating +.>

Is to go out and go in->

Indicating +.>

Is to go out and go in->

Indicating +.>

Is to go out and go in->

Represents the time step variable t +.>

Number of inner dispatch vehicles>

Represents the time step variable t +.>

Number of inner dispatch vehicles>

3. The method for sharing a bicycle according to claim 1, wherein in step S42, the specific method for determining the average motion is as follows: method for rewriting scheduling policy of scheduling vehicle by using one-hot coding mode

Get average action +.>

The calculation formulas are respectively as follows:

wherein ,

a variable denoted 0 or 1 ρ _dim Represents the dimension of the scheduling policy, N represents the maximum value of the schedule tag variable,

4. The method for sharing a bicycle schedule according to claim 1, wherein the experience pool variables of the shared bicycle schedule frame in step S43 include an experience pool

And experience pool capacity->

The training round related variables comprise training round number Episode and updating training round number Episode _upnet The weight coefficient omega updated by the target network and the accumulated return discount factor gamma.

5. The shared bicycle scheduling method based on deep reinforcement learning as claimed in claim 1, wherein the step S44 includes the sub-steps of:

s441: initializing an experience pool

Setting experience pool capacity +.>

Shared bicycle travel demand variable +.>

And sharing bicycle slave eta ² Go out and reach eta ³ Trip flow ratio +.>

s443: updating the status of each dispatch car

And scheduling policy->

s445: updating the next time step status of each scheduler

Average actions->

6. The method for sharing a bicycle scheduling according to claim 1, wherein in step S442, the specific method for updating the shared bicycle running environment when the policy execution state variable tr=0 is: updating and calculating OD tag variable (eta) of travel of shared bicycle ² ，η ³ ) Is a shared bicycle path flow

η ⁵ The actual travel variable of the resulting shared bicycle +.>

η ⁵ The actual attraction variable of the shared bicycle +.>

And the supply amount ++when the policy execution state variable tr=0>

And the supply amount ++when the policy execution state variable tr=1>

each dispatching truck comprises a strategy model and a value model; in the strategy gradient method, the strategy of each dispatching truckThe model comprises a strategy estimation network and a strategy target network; the strategy estimation network constructs a neural network, and parameters of the neural network are theta ⁱ The input is the state of the dispatching vehicle

The output is scheduling policy->

The strategy target network is formed by constructing a neural network, and parameters of the neural network are +.>

Inputting the state of the schedule car in the next time step variable +.>

Scheduling strategy for outputting next time step variable>

Inputting the state of a dispatcher

Scheduling policy->

Average actions->

Output Q value function Q ⁱ Wherein the Q-value function refers to a state-action value function in a reinforcement learning algorithm that represents a cumulative prize value attained by the scheduler; the value target network is formed by constructing a neural networkThe parameters of the network are->

Inputting the state of the next time step variable of the dispatching truck>

Scheduling policy for next time step variable +.>

And the average action of the next time step variable +.>

Output target Q value function +.>

and />

Probability sampling and selecting to obtain action->

wherein ,/>

Is Q ⁱ Functional expression form,/->

Representing an action probability calculation function, A ⁱ Representation->

Action space set of (2) and according to +.>

Update->

Substitution formula->

According to->

Probability sampling and obtaining->

Will->

Act as final choice of policy model of the dispatching truck;

Store to experience pool->

And is from experience pool->

A batch of samples is randomly sampled->

According to sample->

Updating neural network parameters of the value estimation network based on the cumulative return discount factor gamma and the loss function>

Neural network parameter theta for updating strategy model by gradient descent method ⁱ The method comprises the steps of carrying out a first treatment on the surface of the Number of training rounds per interval Episode _upnet In the method, parameters theta of the strategy model neural network are updated according to the weight coefficient omega updated by the target network ⁱ And parameters of the neural network of the value model +.>

And neural network parameters of the value target network +.>

wherein ,/>

Representing global state s _t+1 Representing the global state of the next time step, r _t ⁱ Representing the prize value of the dispatcher +.>

Represent the average action s _t,j Representing global states in a sample, s _t+1,j Global state representing next time step in the sample,/- >

Representing policy in the sample, +_s>

Representing the average action in the sample,/->

Representing a reward value of the dispatching vehicle in the sampling sample;

Store to experience pool->

And is further from experience pool->

A batch of samples is randomly sampled->

According to sample->

Updating neural network parameters of a value estimation network by a cumulative return discount factor gamma and a loss function>

Number of training rounds per interval Episode _upnet In the neural network parameters of the value model according to the weight coefficient omega updated by the target network +.>

Neural network parameters transferred to the value target network>

Is a kind of medium. />