CN111461500B

CN111461500B - Shared bicycle system tide phenomenon control method based on dynamic electronic fence and reinforcement learning

Info

Publication number: CN111461500B
Application number: CN202010172819.1A
Authority: CN
Inventors: 冯强; 贾露露; 任羿; 孙博; 杨德真; 王自力
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2020-03-12
Filing date: 2020-03-12
Publication date: 2022-04-05
Anticipated expiration: 2040-03-12
Also published as: CN111461500A

Abstract

The invention provides a tidal phenomenon control method of a shared bicycle system based on a dynamic electronic fence and reinforcement learning, which is oriented to the problem of unbalanced supply and demand of the shared bicycle system and comprises the following steps: (1) determining state information of the electronic fence group; (2) determining a scheduling action of the electronic fence; (3) determining behaviors and interactions of the agent; (4) determining a current benefit available to take action a; (5) determining a reinforcement learning environment in the electronic fence scheduling system; (6) determining an agent state transition rule based on the DQN neural network; (7) constructing a DQN neural network and carrying out forward calculation; (8) selecting each output action by utilizing a random exploration strategy; (9) training a DQN neural network model and updating parameters; (10) judging whether the training of the DQN neural network is finished; (11) and inputting the initial time and the initial state of the electronic fence group into the trained neural network to acquire the control strategy of the electronic fence.

Description

Shared bicycle system tide phenomenon control method based on dynamic electronic fence and reinforcement learning

(I) technical field

The invention provides a tidal phenomenon control method of a shared bicycle system based on dynamic electronic fence and reinforcement learning, which aims at the problem of unbalanced supply and demand of the urban shared bicycle system. The method is characterized in that under the premise of limited bicycle number and dynamic electronic fences, the electronic fences are zoomed through a reinforcement learning method and pedestrians are induced to stop the bicycles to reasonable positions before the tide phenomenon comes. The method aims at improving the satisfaction degree of customers and the utilization rate of the single vehicle, establishes a Deep Q Network (DQN) -based reinforcement learning model, and performs optimization control and decision on the tide phenomenon of the shared single vehicle, so as to relieve or solve the tide phenomenon in the shared single vehicle system. The method belongs to the field of intelligent transportation.

(II) background of the invention

In order to relieve the urban traffic problem, an important means widely accepted at home and abroad is to implement a new green travel mode and construct a low-carbon and environment-friendly traffic system. In china, the appearance of a shared bicycle with a "pile-free" feature provides a brand-new solution for solving the above problems, and is an important innovation of urban public bicycles, but the development of the shared bicycle at present has some problems to be solved urgently, as follows: (1) the system does not accurately evaluate the demand of the single vehicle in the construction process, and blindly and excessively throwing the single vehicle causes a waste phenomenon, occupies excessive public resources and also causes the increase of the operation cost of enterprises; (2) in the system construction process, the asymmetric requirements of the shared bicycle system are not fully considered, so that a tidal phenomenon is generated, and particularly in the peak time period, the bicycle can be ridden without a bicycle, and the situation that no person rides the bicycle occurs occasionally.

For these problems, the system constructors and supervisors take various measures, such as reducing the influence of tide phenomenon through regular manual vehicle dispatching, and avoiding disordered parking through setting electronic fences and other ways. However, the influence of these measures on the comprehensive benefits of the system, such as the utilization rate of the bicycle, the satisfaction degree of the user and the like, is lack of effective research of the system. Aiming at the requirement, the invention is based on a dynamic electronic fence and a reinforcement learning method, aims at improving the customer satisfaction and the bicycle utilization rate, constructs a shared bicycle system scheduling model considering the customer satisfaction and the bicycle utilization rate, gives a system multi-objective optimization algorithm, and provides a new solution for solving the problem of unbalanced supply and demand of a shared bicycle system and improving the comprehensive benefit of the shared bicycle system.

Disclosure of the invention

(1) Objects of the invention

The invention provides a tide control scheme of an electronic fence-based shared bicycle system, which takes reinforcement learning as a core, and realizes automatic distribution of bicycle positions before a flow peak arrives by standardizing and inducing the parking positions of pedestrians, so that the problem of unbalanced supply and demand caused by tide phenomenon is solved. By controlling the tide phenomenon in the shared bicycle system, the utilization rate of the bicycles and the satisfaction degree of customers can be improved on the premise of the same number of bicycles and electronic fences.

(2) Technical scheme

The invention relates to a tidal control method of a shared bicycle system based on dynamic electronic fences and reinforcement learning. The method comprises the steps of firstly analyzing attributes and parameter systems of two intelligent agents of a bicycle and a pedestrian, and defining bicycle scheduling evaluation indexes (bicycle utilization rate, pedestrian satisfaction degree and relation models thereof). Then, an evaluation process of the single-vehicle utilization rate and the pedestrian satisfaction degree is determined by analyzing the type of the intelligent agent, the interaction mode and the like, and a set of simulation modeling method based on the intelligent agent is formed. Thereafter, the invention determines the goals of electronic fence scheduling and the algorithmic environment, and then analyzes the method for applying DQN to the problem of the bicycle tide control (including the complete process of describing the details of the reinforcement learning algorithm and the algorithm) and determines the overall process of electronic fence bicycle tide control based on the goals. And finally, verifying the proposed reinforcement learning algorithm and the control strategy through a simulation experiment, analyzing the bicycle utilization rate and the pedestrian satisfaction degree before and after the control strategy is implemented, and evaluating and verifying the feasibility and the effectiveness of the method.

The method comprises the following steps:

step one, determining the state information of the electronic fence group.

And step two, determining the size scaling of the electronic fence as a scheduling action.

And step three, determining the behaviors and the interaction of the intelligent agent.

Step four, determining the current benefit available to take action a.

Step five, determining a reward function Q(s)_tA) to evaluate an agent in a particular state s_tHow good the action a is taken.

And step six, determining an intelligent agent state transfer rule based on the DQN neural network, thereby automatically updating the state of the intelligent agent in the reinforcement learning process and continuously interacting with the intelligent agent environment to form a closed loop.

And seventhly, constructing a DQN neural network and performing forward calculation. The method is divided into the following substeps:

(1) determining input information for DQN neural networks

(2) Determining output information for DQN neural networks

(3) Determining DQN neural network structure

And step eight, selecting each output action by utilizing a random exploration strategy.

And step nine, training the DQN neural network model and updating parameters.

And step ten, judging whether the training of the DQN neural network is finished.

And step eleven, inputting the initial time and the initial state of the electronic fence group into the trained neural network, and acquiring the electronic fence control strategy in the shared bicycle system.

Through the steps, the optimal control strategy of the shared bicycle system before the tide phenomenon comes can be obtained.

(IV) description of the drawings

FIG. 1 is an overall architecture of the present invention

FIG. 2 is a flow of agent behavior interaction between a pedestrian, an electronic fence, and a bicycle

FIG. 3 is a satisfaction degree calculation flow

FIG. 4 is a diagram of a neural network architecture

(V) detailed description of the preferred embodiments

The invention provides a tidal control method of a shared bicycle system based on reinforcement learning. Deep-Q-Network is taken as a typical reinforcement learning algorithm, a complex and accurate mathematical model can be avoided being established in the process of solving the intelligent optimization problem, and the size of an electronic fence in a shared bicycle system is effectively scheduled before the tide phenomenon of the shared bicycle occurs, so that the bicycle is induced to the electronic fence area with large demand, and the utilization rate of the shared bicycle and the satisfaction degree of pedestrians are improved. In order to make the technical solution, features and advantages of the present invention more clearly understood, the following detailed description is made with reference to the accompanying drawings. The overall architecture of the present invention is shown in fig. 1, and the specific implementation steps are as follows:

step one, determining the state information of the electronic fence group.

And a state s: (s)₁,s₂,…,s_iT), the states involved mainly include electronic fence group state information s_iI.e., each electronic fence involved in the dispatch increases (or decreases) the cumulative number of parked cars throughout the dispatch period. In addition the instant rewards generated by the state transition enforcement actions are time dependent,the state s should also include the current time t. Here we select the state s of the system as the set of the above two part states and the state s can decide whether the reinforcement learning is finished. The cluster controller obtains the state information at each discrete time point as a decision basis, and the related model calculation process and the related state can be obtained through simulation software.

Behavior a: (a)₁,a₂,…,a_i) Here, the electronic fence size scaling is taken as a scheduling action, that is, the action of enlarging (or reducing) the parking range of each electronic fence at a specific time is taken. Wherein a is₁+a₂+a₃+,...,+a_i0, due to a₁Is the number of vehicles induced to a given fence per unit time period, then these vehicles must come from other fences, so a₁＝-(a₂+a₃+,...,+a_i). This action selection is easy to understand and convenient to compute. The behavior set A is the current state of the system.

The intelligent body interaction is a rule that the intelligent body movement needs to follow, and the interaction of the pedestrian, the bicycle and the environment is mainly reflected in the travel mode of the pedestrian. In the shared bicycle system, the traveling conditions of pedestrians are mainly divided into two types, the first type is irrelevant to the dispatching process, namely, as long as the pedestrians do not accept the closest distance between the bicycle and the pedestrians, the walking mode is selected; the second category relates to the manner of scheduling, i.e., whether the pedestrian accepts the distance to the bicycle but is affected by the scheduling is in the scheduling time period, whether the pedestrian destination is a set of called electronic fences, whether the target electronic fence requirement has been met, and whether to accept the scheduling to continue selecting whether to cycle. The size of the electronic fence is scheduled in a period of time before the tide phenomenon occurs, and the interaction flow of the behaviors of the pedestrians, the electronic fence and the bicycle is shown in fig. 2.

Step four, determining the current benefit available to take action a.

Instant prizeExciter(s)_tA) is in state s_tThe current benefits available following action a, which together determine the value of the instant prize, thus constitute an r(s)_tA) a matrix. The dispatching aim is to meet the premise that the number of the single vehicles reaches the required quantity in certain areas in a specific time period, namely before the tide phenomenon comes, so that the riding rate of the single vehicles reaches the highest value as much as possible, namely the average satisfaction degree of pedestrians reaches the highest value. The average satisfaction degree of the pedestrians in the unit time period during the dispatching can be calculated through simulation and is used as r(s)_tA). The calculation of the average satisfaction degree of the pedestrian mainly needs to count the average daily number of the bicycle ridden, and num _ cycling ++, num _ cycling represents the sum of the number of the bicycle ridden as long as the pedestrian agent state is the using vehicle. Fig. 3 shows several situations where a pedestrian may ride a bicycle, where the calculation of the average satisfaction of the pedestrian mainly requires counting the number of times the bicycle is used. According to the above analysis, the pedestrian rides, namely, firstly determines whether the pedestrian with riding requirements can ride to the vehicle or not, and the four possibilities are associated with each other.

Step five, determining a reward function Q(s)_tA) to evaluate an agent in a particular state s_tHow good or bad the action a is taken

Reward function Q(s)_tA) for evaluating an agent in a particular state s_tThe degree of goodness of the action a, i.e., the action-utility function, is taken. Q(s)_tA) is an instant reward r(s) for a series of actions_tA) desired value of the sum, i.e. Q(s)_t,a)＝E[∑γⁱr_i(s_t,a)]. Solving Q(s) according to a reinforcement learning algorithm_tA) a scheduling scheme is obtained, since Q(s)_tAnd a) instructing the agent to take the most favorable action under the condition that the average pedestrian satisfaction and the average single vehicle utilization rate reach the maximum value finally.

Assuming that three electronic fences A, B and C participate in scheduling, the initial state of the electronic fence is (0,0,0,0), and the state of the electronic fence is (+10, -5, -5) through actions (+10, -5, -5) that A induces 10 single vehicles, and B and C induce 5 single vehicles respectively, the state of the electronic fence is (+10, -5, -5,1), wherein the average pedestrian satisfaction degree at the stage is taken as the reward obtained by the action taken by the intelligent agent, and r is 0.356, and the instantaneous rewards corresponding to the action taken by the electronic fence in different states can be simulated through Anlogic.

The state transition rule of the intelligent agent is determined in the DQN neural network, and then the state of the intelligent agent can be automatically updated in the reinforcement learning process, so that the intelligent agent continuously interacts with the intelligent environment to form a closed loop. Assume agent states as (in _ num, out _ num)₁,out_num₂T), where in _ num is the accumulated number of vehicles induced to be put into the electronic fence A near the teaching building from the target electronic fence at the time t, and out _ num₁And out _ num₂Respectively, the cumulative number of the bicycles that should be parked in the stadium electronic fences B and C within the acceptance range from the start of the schedule to time t, and a is B + C. The scheduling actions are four in number, the set is (a, b, c, d), which are respectively represented as (+10, -5, -5), (+20, -10, -10), (+30, -15, -15) and (+40, -20, -20), and the units in the set are vehicles.

And seventhly, constructing a DQN neural network and performing forward calculation.

A typical neuron consists of five parts of input, weight value and closed value, a summation unit, an excitation function and output, and a neural network structure for storing the value function is shown in FIG. 4.

(1) Determining input information for DQN neural networks

The input layer is(s)₁,s₂,…,s_jT), t denotes time t, input layer s₁Representing the cumulative number of dispatched vehicles, s, of the target electronic fence at that moment₂,…,s_jRepresenting the cumulative number of dispatched vehicles for each of the dispatched electronic fences

(2) Determining output information for DQN neural networks

The output layer is (a)₁,a₂,…,a_n) The dimension of the output layer is n, which represents a total of n scheduling actions, a_iRepresenting the ith scheduling action e.g., (+10, -5, -5).

(3) Determining structure of DQN neural network

The depth of the neural network has two hidden layers, namely depth 2. The dimension of the input neuron is j +1, the dimension of the output neuron is n, and the action a at the moment of t +1 can be determined after the action selection of the output layer selects the action corresponding to the corresponding Q value according to epsilon-greedy_k. The neural network is a complex network formed by connecting a large number of simple neurons with each other, and the summing unit performs weighted summation on input signals and then takes the summation result as the output of the neurons through the operation of an excitation function. The output of the jth neuron for the entire neuron is:

wherein

The weight that represents the k-th neuron of layer l-1 connected to the jth neuron of layer l (the input layer is layer 0, where l is 2),

is the input of the kth neuron of the upper layer,

for the bias of the jth neuron at the l-th layer, σ is a stimulus function, and considering that the linear model expression capability of a neural network is not enough, a nonlinear factor is added through the stimulus function to solve the nonlinear problem, starting from the micromability and monotonicity, typical forms of the stimulus function are tanh, sigmoid and ReLU, wherein the ReLU has the characteristics of high convergence speed, simple calculation, difficulty in saturation and the like, so currently, ReLU is used for replacing sigmoid, and the formula is shown as follows:

f(x)＝max(0,x)

wherein the weight value

And bias

Are tunable, they reflect behavioral characteristics of neural networks

In the dispatching process of the electric fence bicycle, randomness is seen everywhere, such as that a starting point and a destination of a pedestrian in riding are uncertain, and whether the pedestrian in riding is uncertain or not, so that an epsilon-greedy behavior selection strategy is used in a DQN neural network, and the problem that an optimal strategy cannot be obtained sometimes through an algorithm based on a cost function, such as DQN, can be solved through the epsilon-greedy random exploration strategy. The epsilon-greedy action includes the time and the size of each electronic fence at that time, and a smaller epsilon value is set to prevent the algorithm from falling into a locally optimal solution so that the agent maintains a certain exploratory property to search for a globally optimal solution. After a Q value list is obtained through a neural network, selecting an action to be taken according to an epsilon-greedy behavior selection strategy: the probability of 1-epsilon ensures that the action to be taken next by the electronic fence system is determined by the maximum value of the Q value output by the value neural network, one Q value corresponds to one action, the probability of epsilon is explored, namely, one action is randomly selected, and the action is continuously selected until the next state is reached after the action is taken.

And step nine, training the DQN neural network model and updating parameters.

(1) Back propagation of neural networks

The update of the neural network parameters involves Back propagation (Back propagation): in defining a neural network, each node is randomly assigned a weight and a bias. After one iteration, the deviation of the whole network can be calculated according to the generated result, and then the deviation is combined with the gradient of the cost function to correspondingly schedule the weight factor, so that the deviation is reduced in the process of the next iteration. Such a process of scheduling the weight factors in combination with the gradient of the cost function is called back-propagation. In back propagation, the direction of propagation of the signal is backward, and the error propagates from the output layer along the hidden layer along with the gradient of the cost function, accompanied by the scheduling of the weight factors.

Loss function of DQN (loss _ function function):

L(θ)＝E[(TargetQ-Q(s,a；θ))²]

where θ is the target parameter, the target is:

in the machine learning algorithm, firstly, a loss function of the model is determined according to a target value and a true value, then a gradient descent algorithm such as a quasi-newton method is selected to reduce the loss function step by step and update model parameters, namely after the loss function L (theta) is determined, the parameter theta is updated through gradient descent:

(2) construction of experience pool and parameter update between value neural network and real neural network

In addition, in order to reduce the influence of the relevance between the Q estimation value and the Q-realistic neural network training continuous sample data on the convergence of the loss function as far as possible, two neural networks with the same structure, such as input and output sizes and network depths, but different network parameters are established in the DQN, and then a delayed update technology, namely fixed Q-targets, is used, which is a mechanism for planning the correlation of the training samples, and the parameters used by the Q-realistic neural network are updated by delaying the parameters of the Q-estimated neural network by a certain number of steps, while the parameters of the Q-estimated neural network are the latest. In addition, an Experience pool (Experience replay) is introduced into the neural network. And collecting data acquired by the intelligent agent in the simulation process by using an experience pool, accumulating sample data in the experience pool to a certain degree, and randomly extracting a batch of data from the samples of the time series, wherein the size is determined by batch-size.

And setting the number of the epsilon, and judging whether one-time epsilon of the DQN neural network is finished or not according to a model exit condition during each training, namely whether the number of the dispatching bicycles meets the dispatching requirement or whether the dispatching times reaches the dispatching upper limit or not. And after the training times specified by the epicode are finished and the loss function is reduced to a certain value, determining that the training of the DQN neural network is finished.

After training of the reinforcement learning DQN neural network is completed, inputting the initial time and the initial state of the electronic fence group into the trained neural network, and automatically obtaining a series of time scheduling actions by combining the judgment condition, namely whether the single-vehicle scheduling number or the electronic fence scheduling frequency meets the requirement, so that the whole electronic fence scheduling scheme is obtained. Assuming that the scheduling actions include (a, b, c, d), respectively (+10, -5, -5), (+20, -10, -10), (+30, -15, -15), (+40, -20, -20), the procedure of scheduling a reward known DQN resulting from taking different actions given in the table below is a → d → b → b → None (end of scheduling).

Claims

1. A tide phenomenon control method of a shared bicycle system based on dynamic electronic fence and reinforcement learning comprises the following steps:

step one, determining the state information of the electronic fence group: state information s: (s)₁,s₂,…,s_jT) mainly includes electronic fence group status information s_jI.e. each electronic fence involved in the dispatch is parked incrementally or decrementally throughout the dispatch periodThe accumulated number of the single vehicles and the current time t;

step two, determining the size scaling of the electronic fence as a scheduling action;

step three, determining the behavior and interaction of the intelligent agent;

step four, determining the current benefit available by taking action a: (a)₁,a₂,…,a_i) I.e. the parking area of each electronic fence, enlarged or reduced at a particular moment, a_iA scheduling action representing the ith electronic fence;

step five, determining a reward function Q(s)_tA) to evaluate an agent in a particular state s_tHow good the action a is taken down, s_tAs current state information s_t＝s：(s₁,s₂,…,s_j,t)；

Sixthly, determining an intelligent agent state transfer rule based on the DQN neural network, so that the state of the intelligent agent is automatically updated in the reinforcement learning process and continuously interacts with the intelligent environment to form a closed loop;

step seven, constructing a DQN neural network and carrying out forward calculation, and the method is divided into the following substeps:

(1) determining input information of the DQN neural network, the input information being state information(s)₁,s₂,…,s_jT), t denotes time t, s₁Representing the cumulative number of dispatched vehicles, s, of the target electronic fence at that moment₂,…,s_jRepresenting the respective accumulated number of dispatched vehicles for the dispatched electronic fences;

(2) determining output information of the DQN neural network, the output information action a, a₁,a₂,...,a_i)，a_iRepresenting the scheduling action of the ith electronic fence, wherein n electronic fences are in total;

(3) determining a DQN neural network structure;

step eight, selecting each output action a by utilizing a random exploration strategy;

step nine, training a DQN neural network model and updating parameters;

step ten, judging whether the training of the DQN neural network is finished;