CN111461500B - Shared bicycle system tide phenomenon control method based on dynamic electronic fence and reinforcement learning - Google Patents
Shared bicycle system tide phenomenon control method based on dynamic electronic fence and reinforcement learning Download PDFInfo
- Publication number
- CN111461500B CN111461500B CN202010172819.1A CN202010172819A CN111461500B CN 111461500 B CN111461500 B CN 111461500B CN 202010172819 A CN202010172819 A CN 202010172819A CN 111461500 B CN111461500 B CN 111461500B
- Authority
- CN
- China
- Prior art keywords
- electronic fence
- determining
- neural network
- action
- dqn
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 36
- 230000002787 reinforcement Effects 0.000 title claims abstract description 22
- 238000013528 artificial neural network Methods 0.000 claims abstract description 50
- 230000009471 action Effects 0.000 claims abstract description 43
- 238000012549 training Methods 0.000 claims abstract description 14
- 230000006399 behavior Effects 0.000 claims abstract description 10
- 230000003993 interaction Effects 0.000 claims abstract description 9
- 230000008901 benefit Effects 0.000 claims abstract description 8
- 238000004364 calculation method Methods 0.000 claims abstract description 8
- 238000011217 control strategy Methods 0.000 claims abstract description 7
- 238000003062 neural network model Methods 0.000 claims abstract description 4
- 230000006870 function Effects 0.000 claims description 23
- 230000008569 process Effects 0.000 claims description 15
- 230000001186 cumulative effect Effects 0.000 claims description 5
- 238000012546 transfer Methods 0.000 claims description 3
- 230000007704 transition Effects 0.000 abstract description 3
- 238000004422 calculation algorithm Methods 0.000 description 10
- 210000002569 neuron Anatomy 0.000 description 9
- 238000010276 construction Methods 0.000 description 3
- 238000005457 optimization Methods 0.000 description 3
- 238000004088 simulation Methods 0.000 description 3
- 230000001351 cycling effect Effects 0.000 description 2
- 230000005284 excitation Effects 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 230000003542 behavioural effect Effects 0.000 description 1
- 229910052799 carbon Inorganic materials 0.000 description 1
- 238000005094 computer simulation Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 230000003111 delayed effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000012854 evaluation process Methods 0.000 description 1
- 230000002349 favourable effect Effects 0.000 description 1
- 230000001939 inductive effect Effects 0.000 description 1
- 210000002364 input neuron Anatomy 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000013178 mathematical model Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012821 model calculation Methods 0.000 description 1
- 210000004205 output neuron Anatomy 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000036962 time dependent Effects 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/06—Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
- G06Q10/063—Operations research, analysis or management
- G06Q10/0631—Resource planning, allocation, distributing or scheduling for enterprises or organisations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/06—Buying, selling or leasing transactions
- G06Q30/0645—Rental transactions; Leasing transactions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/40—Business processes related to the transportation industry
Landscapes
- Business, Economics & Management (AREA)
- Engineering & Computer Science (AREA)
- Human Resources & Organizations (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Strategic Management (AREA)
- Economics (AREA)
- General Physics & Mathematics (AREA)
- Accounting & Taxation (AREA)
- Marketing (AREA)
- General Business, Economics & Management (AREA)
- Entrepreneurship & Innovation (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Finance (AREA)
- Tourism & Hospitality (AREA)
- Development Economics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Molecular Biology (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Operations Research (AREA)
- Quality & Reliability (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Game Theory and Decision Science (AREA)
- Educational Administration (AREA)
- Primary Health Care (AREA)
- Feedback Control In General (AREA)
- Traffic Control Systems (AREA)
Abstract
The invention provides a tidal phenomenon control method of a shared bicycle system based on a dynamic electronic fence and reinforcement learning, which is oriented to the problem of unbalanced supply and demand of the shared bicycle system and comprises the following steps: (1) determining state information of the electronic fence group; (2) determining a scheduling action of the electronic fence; (3) determining behaviors and interactions of the agent; (4) determining a current benefit available to take action a; (5) determining a reinforcement learning environment in the electronic fence scheduling system; (6) determining an agent state transition rule based on the DQN neural network; (7) constructing a DQN neural network and carrying out forward calculation; (8) selecting each output action by utilizing a random exploration strategy; (9) training a DQN neural network model and updating parameters; (10) judging whether the training of the DQN neural network is finished; (11) and inputting the initial time and the initial state of the electronic fence group into the trained neural network to acquire the control strategy of the electronic fence.
Description
(I) technical field
The invention provides a tidal phenomenon control method of a shared bicycle system based on dynamic electronic fence and reinforcement learning, which aims at the problem of unbalanced supply and demand of the urban shared bicycle system. The method is characterized in that under the premise of limited bicycle number and dynamic electronic fences, the electronic fences are zoomed through a reinforcement learning method and pedestrians are induced to stop the bicycles to reasonable positions before the tide phenomenon comes. The method aims at improving the satisfaction degree of customers and the utilization rate of the single vehicle, establishes a Deep Q Network (DQN) -based reinforcement learning model, and performs optimization control and decision on the tide phenomenon of the shared single vehicle, so as to relieve or solve the tide phenomenon in the shared single vehicle system. The method belongs to the field of intelligent transportation.
(II) background of the invention
In order to relieve the urban traffic problem, an important means widely accepted at home and abroad is to implement a new green travel mode and construct a low-carbon and environment-friendly traffic system. In china, the appearance of a shared bicycle with a "pile-free" feature provides a brand-new solution for solving the above problems, and is an important innovation of urban public bicycles, but the development of the shared bicycle at present has some problems to be solved urgently, as follows: (1) the system does not accurately evaluate the demand of the single vehicle in the construction process, and blindly and excessively throwing the single vehicle causes a waste phenomenon, occupies excessive public resources and also causes the increase of the operation cost of enterprises; (2) in the system construction process, the asymmetric requirements of the shared bicycle system are not fully considered, so that a tidal phenomenon is generated, and particularly in the peak time period, the bicycle can be ridden without a bicycle, and the situation that no person rides the bicycle occurs occasionally.
For these problems, the system constructors and supervisors take various measures, such as reducing the influence of tide phenomenon through regular manual vehicle dispatching, and avoiding disordered parking through setting electronic fences and other ways. However, the influence of these measures on the comprehensive benefits of the system, such as the utilization rate of the bicycle, the satisfaction degree of the user and the like, is lack of effective research of the system. Aiming at the requirement, the invention is based on a dynamic electronic fence and a reinforcement learning method, aims at improving the customer satisfaction and the bicycle utilization rate, constructs a shared bicycle system scheduling model considering the customer satisfaction and the bicycle utilization rate, gives a system multi-objective optimization algorithm, and provides a new solution for solving the problem of unbalanced supply and demand of a shared bicycle system and improving the comprehensive benefit of the shared bicycle system.
Disclosure of the invention
(1) Objects of the invention
The invention provides a tide control scheme of an electronic fence-based shared bicycle system, which takes reinforcement learning as a core, and realizes automatic distribution of bicycle positions before a flow peak arrives by standardizing and inducing the parking positions of pedestrians, so that the problem of unbalanced supply and demand caused by tide phenomenon is solved. By controlling the tide phenomenon in the shared bicycle system, the utilization rate of the bicycles and the satisfaction degree of customers can be improved on the premise of the same number of bicycles and electronic fences.
(2) Technical scheme
The invention relates to a tidal control method of a shared bicycle system based on dynamic electronic fences and reinforcement learning. The method comprises the steps of firstly analyzing attributes and parameter systems of two intelligent agents of a bicycle and a pedestrian, and defining bicycle scheduling evaluation indexes (bicycle utilization rate, pedestrian satisfaction degree and relation models thereof). Then, an evaluation process of the single-vehicle utilization rate and the pedestrian satisfaction degree is determined by analyzing the type of the intelligent agent, the interaction mode and the like, and a set of simulation modeling method based on the intelligent agent is formed. Thereafter, the invention determines the goals of electronic fence scheduling and the algorithmic environment, and then analyzes the method for applying DQN to the problem of the bicycle tide control (including the complete process of describing the details of the reinforcement learning algorithm and the algorithm) and determines the overall process of electronic fence bicycle tide control based on the goals. And finally, verifying the proposed reinforcement learning algorithm and the control strategy through a simulation experiment, analyzing the bicycle utilization rate and the pedestrian satisfaction degree before and after the control strategy is implemented, and evaluating and verifying the feasibility and the effectiveness of the method.
The method comprises the following steps:
step one, determining the state information of the electronic fence group.
And step two, determining the size scaling of the electronic fence as a scheduling action.
And step three, determining the behaviors and the interaction of the intelligent agent.
Step four, determining the current benefit available to take action a.
Step five, determining a reward function Q(s)tA) to evaluate an agent in a particular state stHow good the action a is taken.
And step six, determining an intelligent agent state transfer rule based on the DQN neural network, thereby automatically updating the state of the intelligent agent in the reinforcement learning process and continuously interacting with the intelligent agent environment to form a closed loop.
And seventhly, constructing a DQN neural network and performing forward calculation. The method is divided into the following substeps:
(1) determining input information for DQN neural networks
(2) Determining output information for DQN neural networks
(3) Determining DQN neural network structure
And step eight, selecting each output action by utilizing a random exploration strategy.
And step nine, training the DQN neural network model and updating parameters.
And step ten, judging whether the training of the DQN neural network is finished.
And step eleven, inputting the initial time and the initial state of the electronic fence group into the trained neural network, and acquiring the electronic fence control strategy in the shared bicycle system.
Through the steps, the optimal control strategy of the shared bicycle system before the tide phenomenon comes can be obtained.
(IV) description of the drawings
FIG. 1 is an overall architecture of the present invention
FIG. 2 is a flow of agent behavior interaction between a pedestrian, an electronic fence, and a bicycle
FIG. 3 is a satisfaction degree calculation flow
FIG. 4 is a diagram of a neural network architecture
(V) detailed description of the preferred embodiments
The invention provides a tidal control method of a shared bicycle system based on reinforcement learning. Deep-Q-Network is taken as a typical reinforcement learning algorithm, a complex and accurate mathematical model can be avoided being established in the process of solving the intelligent optimization problem, and the size of an electronic fence in a shared bicycle system is effectively scheduled before the tide phenomenon of the shared bicycle occurs, so that the bicycle is induced to the electronic fence area with large demand, and the utilization rate of the shared bicycle and the satisfaction degree of pedestrians are improved. In order to make the technical solution, features and advantages of the present invention more clearly understood, the following detailed description is made with reference to the accompanying drawings. The overall architecture of the present invention is shown in fig. 1, and the specific implementation steps are as follows:
step one, determining the state information of the electronic fence group.
And a state s: (s)1,s2,…,siT), the states involved mainly include electronic fence group state information siI.e., each electronic fence involved in the dispatch increases (or decreases) the cumulative number of parked cars throughout the dispatch period. In addition the instant rewards generated by the state transition enforcement actions are time dependent,the state s should also include the current time t. Here we select the state s of the system as the set of the above two part states and the state s can decide whether the reinforcement learning is finished. The cluster controller obtains the state information at each discrete time point as a decision basis, and the related model calculation process and the related state can be obtained through simulation software.
And step two, determining the size scaling of the electronic fence as a scheduling action.
Behavior a: (a)1,a2,…,ai) Here, the electronic fence size scaling is taken as a scheduling action, that is, the action of enlarging (or reducing) the parking range of each electronic fence at a specific time is taken. Wherein a is1+a2+a3+,...,+ai0, due to a1Is the number of vehicles induced to a given fence per unit time period, then these vehicles must come from other fences, so a1=-(a2+a3+,...,+ai). This action selection is easy to understand and convenient to compute. The behavior set A is the current state of the system.
And step three, determining the behaviors and the interaction of the intelligent agent.
The intelligent body interaction is a rule that the intelligent body movement needs to follow, and the interaction of the pedestrian, the bicycle and the environment is mainly reflected in the travel mode of the pedestrian. In the shared bicycle system, the traveling conditions of pedestrians are mainly divided into two types, the first type is irrelevant to the dispatching process, namely, as long as the pedestrians do not accept the closest distance between the bicycle and the pedestrians, the walking mode is selected; the second category relates to the manner of scheduling, i.e., whether the pedestrian accepts the distance to the bicycle but is affected by the scheduling is in the scheduling time period, whether the pedestrian destination is a set of called electronic fences, whether the target electronic fence requirement has been met, and whether to accept the scheduling to continue selecting whether to cycle. The size of the electronic fence is scheduled in a period of time before the tide phenomenon occurs, and the interaction flow of the behaviors of the pedestrians, the electronic fence and the bicycle is shown in fig. 2.
Step four, determining the current benefit available to take action a.
Instant prizeExciter(s)tA) is in state stThe current benefits available following action a, which together determine the value of the instant prize, thus constitute an r(s)tA) a matrix. The dispatching aim is to meet the premise that the number of the single vehicles reaches the required quantity in certain areas in a specific time period, namely before the tide phenomenon comes, so that the riding rate of the single vehicles reaches the highest value as much as possible, namely the average satisfaction degree of pedestrians reaches the highest value. The average satisfaction degree of the pedestrians in the unit time period during the dispatching can be calculated through simulation and is used as r(s)tA). The calculation of the average satisfaction degree of the pedestrian mainly needs to count the average daily number of the bicycle ridden, and num _ cycling ++, num _ cycling represents the sum of the number of the bicycle ridden as long as the pedestrian agent state is the using vehicle. Fig. 3 shows several situations where a pedestrian may ride a bicycle, where the calculation of the average satisfaction of the pedestrian mainly requires counting the number of times the bicycle is used. According to the above analysis, the pedestrian rides, namely, firstly determines whether the pedestrian with riding requirements can ride to the vehicle or not, and the four possibilities are associated with each other.
Step five, determining a reward function Q(s)tA) to evaluate an agent in a particular state stHow good or bad the action a is taken
Reward function Q(s)tA) for evaluating an agent in a particular state stThe degree of goodness of the action a, i.e., the action-utility function, is taken. Q(s)tA) is an instant reward r(s) for a series of actionstA) desired value of the sum, i.e. Q(s)t,a)=E[∑γiri(st,a)]. Solving Q(s) according to a reinforcement learning algorithmtA) a scheduling scheme is obtained, since Q(s)tAnd a) instructing the agent to take the most favorable action under the condition that the average pedestrian satisfaction and the average single vehicle utilization rate reach the maximum value finally.
Assuming that three electronic fences A, B and C participate in scheduling, the initial state of the electronic fence is (0,0,0,0), and the state of the electronic fence is (+10, -5, -5) through actions (+10, -5, -5) that A induces 10 single vehicles, and B and C induce 5 single vehicles respectively, the state of the electronic fence is (+10, -5, -5,1), wherein the average pedestrian satisfaction degree at the stage is taken as the reward obtained by the action taken by the intelligent agent, and r is 0.356, and the instantaneous rewards corresponding to the action taken by the electronic fence in different states can be simulated through Anlogic.
And step six, determining an intelligent agent state transfer rule based on the DQN neural network, thereby automatically updating the state of the intelligent agent in the reinforcement learning process and continuously interacting with the intelligent agent environment to form a closed loop.
The state transition rule of the intelligent agent is determined in the DQN neural network, and then the state of the intelligent agent can be automatically updated in the reinforcement learning process, so that the intelligent agent continuously interacts with the intelligent environment to form a closed loop. Assume agent states as (in _ num, out _ num)1,out_num2T), where in _ num is the accumulated number of vehicles induced to be put into the electronic fence A near the teaching building from the target electronic fence at the time t, and out _ num1And out _ num2Respectively, the cumulative number of the bicycles that should be parked in the stadium electronic fences B and C within the acceptance range from the start of the schedule to time t, and a is B + C. The scheduling actions are four in number, the set is (a, b, c, d), which are respectively represented as (+10, -5, -5), (+20, -10, -10), (+30, -15, -15) and (+40, -20, -20), and the units in the set are vehicles.
And seventhly, constructing a DQN neural network and performing forward calculation.
A typical neuron consists of five parts of input, weight value and closed value, a summation unit, an excitation function and output, and a neural network structure for storing the value function is shown in FIG. 4.
(1) Determining input information for DQN neural networks
The input layer is(s)1,s2,…,sjT), t denotes time t, input layer s1Representing the cumulative number of dispatched vehicles, s, of the target electronic fence at that moment2,…,sjRepresenting the cumulative number of dispatched vehicles for each of the dispatched electronic fences
(2) Determining output information for DQN neural networks
The output layer is (a)1,a2,…,an) The dimension of the output layer is n, which represents a total of n scheduling actions, aiRepresenting the ith scheduling action e.g., (+10, -5, -5).
(3) Determining structure of DQN neural network
The depth of the neural network has two hidden layers, namely depth 2. The dimension of the input neuron is j +1, the dimension of the output neuron is n, and the action a at the moment of t +1 can be determined after the action selection of the output layer selects the action corresponding to the corresponding Q value according to epsilon-greedyk. The neural network is a complex network formed by connecting a large number of simple neurons with each other, and the summing unit performs weighted summation on input signals and then takes the summation result as the output of the neurons through the operation of an excitation function. The output of the jth neuron for the entire neuron is:
whereinThe weight that represents the k-th neuron of layer l-1 connected to the jth neuron of layer l (the input layer is layer 0, where l is 2),is the input of the kth neuron of the upper layer,for the bias of the jth neuron at the l-th layer, σ is a stimulus function, and considering that the linear model expression capability of a neural network is not enough, a nonlinear factor is added through the stimulus function to solve the nonlinear problem, starting from the micromability and monotonicity, typical forms of the stimulus function are tanh, sigmoid and ReLU, wherein the ReLU has the characteristics of high convergence speed, simple calculation, difficulty in saturation and the like, so currently, ReLU is used for replacing sigmoid, and the formula is shown as follows:
f(x)=max(0,x)
wherein the weight valueAnd biasAre tunable, they reflect behavioral characteristics of neural networks
And step eight, selecting each output action by utilizing a random exploration strategy.
In the dispatching process of the electric fence bicycle, randomness is seen everywhere, such as that a starting point and a destination of a pedestrian in riding are uncertain, and whether the pedestrian in riding is uncertain or not, so that an epsilon-greedy behavior selection strategy is used in a DQN neural network, and the problem that an optimal strategy cannot be obtained sometimes through an algorithm based on a cost function, such as DQN, can be solved through the epsilon-greedy random exploration strategy. The epsilon-greedy action includes the time and the size of each electronic fence at that time, and a smaller epsilon value is set to prevent the algorithm from falling into a locally optimal solution so that the agent maintains a certain exploratory property to search for a globally optimal solution. After a Q value list is obtained through a neural network, selecting an action to be taken according to an epsilon-greedy behavior selection strategy: the probability of 1-epsilon ensures that the action to be taken next by the electronic fence system is determined by the maximum value of the Q value output by the value neural network, one Q value corresponds to one action, the probability of epsilon is explored, namely, one action is randomly selected, and the action is continuously selected until the next state is reached after the action is taken.
And step nine, training the DQN neural network model and updating parameters.
(1) Back propagation of neural networks
The update of the neural network parameters involves Back propagation (Back propagation): in defining a neural network, each node is randomly assigned a weight and a bias. After one iteration, the deviation of the whole network can be calculated according to the generated result, and then the deviation is combined with the gradient of the cost function to correspondingly schedule the weight factor, so that the deviation is reduced in the process of the next iteration. Such a process of scheduling the weight factors in combination with the gradient of the cost function is called back-propagation. In back propagation, the direction of propagation of the signal is backward, and the error propagates from the output layer along the hidden layer along with the gradient of the cost function, accompanied by the scheduling of the weight factors.
Loss function of DQN (loss _ function function):
L(θ)=E[(TargetQ-Q(s,a;θ))2]
where θ is the target parameter, the target is:
in the machine learning algorithm, firstly, a loss function of the model is determined according to a target value and a true value, then a gradient descent algorithm such as a quasi-newton method is selected to reduce the loss function step by step and update model parameters, namely after the loss function L (theta) is determined, the parameter theta is updated through gradient descent:
(2) construction of experience pool and parameter update between value neural network and real neural network
In addition, in order to reduce the influence of the relevance between the Q estimation value and the Q-realistic neural network training continuous sample data on the convergence of the loss function as far as possible, two neural networks with the same structure, such as input and output sizes and network depths, but different network parameters are established in the DQN, and then a delayed update technology, namely fixed Q-targets, is used, which is a mechanism for planning the correlation of the training samples, and the parameters used by the Q-realistic neural network are updated by delaying the parameters of the Q-estimated neural network by a certain number of steps, while the parameters of the Q-estimated neural network are the latest. In addition, an Experience pool (Experience replay) is introduced into the neural network. And collecting data acquired by the intelligent agent in the simulation process by using an experience pool, accumulating sample data in the experience pool to a certain degree, and randomly extracting a batch of data from the samples of the time series, wherein the size is determined by batch-size.
And step ten, judging whether the training of the DQN neural network is finished.
And setting the number of the epsilon, and judging whether one-time epsilon of the DQN neural network is finished or not according to a model exit condition during each training, namely whether the number of the dispatching bicycles meets the dispatching requirement or whether the dispatching times reaches the dispatching upper limit or not. And after the training times specified by the epicode are finished and the loss function is reduced to a certain value, determining that the training of the DQN neural network is finished.
And step eleven, inputting the initial time and the initial state of the electronic fence group into the trained neural network, and acquiring the electronic fence control strategy in the shared bicycle system.
After training of the reinforcement learning DQN neural network is completed, inputting the initial time and the initial state of the electronic fence group into the trained neural network, and automatically obtaining a series of time scheduling actions by combining the judgment condition, namely whether the single-vehicle scheduling number or the electronic fence scheduling frequency meets the requirement, so that the whole electronic fence scheduling scheme is obtained. Assuming that the scheduling actions include (a, b, c, d), respectively (+10, -5, -5), (+20, -10, -10), (+30, -15, -15), (+40, -20, -20), the procedure of scheduling a reward known DQN resulting from taking different actions given in the table below is a → d → b → b → None (end of scheduling).
Claims (1)
1. A tide phenomenon control method of a shared bicycle system based on dynamic electronic fence and reinforcement learning comprises the following steps:
step one, determining the state information of the electronic fence group: state information s: (s)1,s2,…,sjT) mainly includes electronic fence group status information sjI.e. each electronic fence involved in the dispatch is parked incrementally or decrementally throughout the dispatch periodThe accumulated number of the single vehicles and the current time t;
step two, determining the size scaling of the electronic fence as a scheduling action;
step three, determining the behavior and interaction of the intelligent agent;
step four, determining the current benefit available by taking action a: (a)1,a2,…,ai) I.e. the parking area of each electronic fence, enlarged or reduced at a particular moment, aiA scheduling action representing the ith electronic fence;
step five, determining a reward function Q(s)tA) to evaluate an agent in a particular state stHow good the action a is taken down, stAs current state information st=s:(s1,s2,…,sj,t);
Sixthly, determining an intelligent agent state transfer rule based on the DQN neural network, so that the state of the intelligent agent is automatically updated in the reinforcement learning process and continuously interacts with the intelligent environment to form a closed loop;
step seven, constructing a DQN neural network and carrying out forward calculation, and the method is divided into the following substeps:
(1) determining input information of the DQN neural network, the input information being state information(s)1,s2,…,sjT), t denotes time t, s1Representing the cumulative number of dispatched vehicles, s, of the target electronic fence at that moment2,…,sjRepresenting the respective accumulated number of dispatched vehicles for the dispatched electronic fences;
(2) determining output information of the DQN neural network, the output information action a, a1,a2,...,ai),aiRepresenting the scheduling action of the ith electronic fence, wherein n electronic fences are in total;
(3) determining a DQN neural network structure;
step eight, selecting each output action a by utilizing a random exploration strategy;
step nine, training a DQN neural network model and updating parameters;
step ten, judging whether the training of the DQN neural network is finished;
and step eleven, inputting the initial time and the initial state of the electronic fence group into the trained neural network, and acquiring the electronic fence control strategy in the shared bicycle system.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010172819.1A CN111461500B (en) | 2020-03-12 | 2020-03-12 | Shared bicycle system tide phenomenon control method based on dynamic electronic fence and reinforcement learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010172819.1A CN111461500B (en) | 2020-03-12 | 2020-03-12 | Shared bicycle system tide phenomenon control method based on dynamic electronic fence and reinforcement learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111461500A CN111461500A (en) | 2020-07-28 |
CN111461500B true CN111461500B (en) | 2022-04-05 |
Family
ID=71684448
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010172819.1A Active CN111461500B (en) | 2020-03-12 | 2020-03-12 | Shared bicycle system tide phenomenon control method based on dynamic electronic fence and reinforcement learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111461500B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112348258B (en) * | 2020-11-09 | 2022-09-20 | 合肥工业大学 | Shared bicycle predictive scheduling method based on deep Q network |
CN113095406B (en) * | 2021-04-14 | 2022-04-26 | 国能智慧科技发展(江苏)有限公司 | Electronic fence effective time period management and control method based on intelligent Internet of things |
CN114897656B (en) * | 2022-07-15 | 2022-11-25 | 深圳市城市交通规划设计研究中心股份有限公司 | Shared bicycle tidal area parking dredging method, electronic equipment and storage medium |
CN115879016B (en) * | 2023-02-20 | 2023-05-16 | 中南大学 | Prediction method for travel tide period of shared bicycle |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP3872715A1 (en) * | 2015-11-12 | 2021-09-01 | Deepmind Technologies Limited | Asynchronous deep reinforcement learning |
CN105491124B (en) * | 2015-12-03 | 2018-11-02 | 北京航空航天大学 | Mobile vehicle distribution polymerization |
CN109447573A (en) * | 2018-10-09 | 2019-03-08 | 中国兵器装备集团上海电控研究所 | The specification parking management system and method for internet car rental |
TW202020473A (en) * | 2018-11-27 | 2020-06-01 | 奇異平台股份有限公司 | Electronic fence and electronic fence system |
-
2020
- 2020-03-12 CN CN202010172819.1A patent/CN111461500B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN111461500A (en) | 2020-07-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111461500B (en) | Shared bicycle system tide phenomenon control method based on dynamic electronic fence and reinforcement learning | |
CN110032782B (en) | City-level intelligent traffic signal control system and method | |
CN112216124B (en) | Traffic signal control method based on deep reinforcement learning | |
CN111696370B (en) | Traffic light control method based on heuristic deep Q network | |
CN112669629B (en) | Real-time traffic signal control method and device based on deep reinforcement learning | |
CN112700664A (en) | Traffic signal timing optimization method based on deep reinforcement learning | |
CN103280114B (en) | Signal lamp intelligent control method based on BP-PSO fuzzy neural network | |
CN110794842A (en) | Reinforced learning path planning algorithm based on potential field | |
CN112364984A (en) | Cooperative multi-agent reinforcement learning method | |
CN112365724A (en) | Continuous intersection signal cooperative control method based on deep reinforcement learning | |
Lin et al. | Traffic signal optimization based on fuzzy control and differential evolution algorithm | |
CN109558985A (en) | A kind of bus passenger flow amount prediction technique based on BP neural network | |
CN114758497B (en) | Adaptive parking lot variable entrance and exit control method, device and storage medium | |
CN111985619B (en) | Urban single intersection control method based on short-time traffic flow prediction | |
CN112950251A (en) | Reputation-based vehicle crowd sensing node reverse combination auction excitation optimization method | |
CN107087161A (en) | The Forecasting Methodology of user experience quality based on multilayer neural network in video traffic | |
CN106781465A (en) | A kind of road traffic Forecasting Methodology | |
CN109544913A (en) | A kind of traffic lights dynamic timing algorithm based on depth Q e-learning | |
CN106781464A (en) | A kind of congestion in road situation method of testing | |
Ahmad et al. | Applications of evolutionary game theory in urban road transport network: A state of the art review | |
CN112950963A (en) | Self-adaptive signal control optimization method for main branch intersection of city | |
CN113724507B (en) | Traffic control and vehicle guidance cooperative method and system based on deep reinforcement learning | |
CN114572229A (en) | Vehicle speed prediction method, device, medium and equipment based on graph neural network | |
Chentoufi et al. | A hybrid particle swarm optimization and tabu search algorithm for adaptive traffic signal timing optimization | |
CN116071939B (en) | Traffic signal control model building method and control method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |