CN112348258A

CN112348258A - Shared bicycle predictive scheduling method based on deep Q network

Info

Publication number: CN112348258A
Application number: CN202011240256.1A
Authority: CN
Inventors: 史明光; 盛洲
Original assignee: Hefei University of Technology
Current assignee: Hefei University of Technology
Priority date: 2020-11-09
Filing date: 2020-11-09
Publication date: 2021-02-09
Anticipated expiration: 2040-11-09
Also published as: CN112348258B

Abstract

The invention discloses a method for predicting and dispatching shared bicycles based on a deep Q network, comprising the steps of: 1. designing a simulation environment for simulating the actual dispatching of shared bicycles; 2. acquiring user information and constructing a user behavior data matrix; 3. training by A prediction network model composed of a linear regression model and an SVM model; 4. Combine the prediction network model to train a prediction scheduling model based on a deep Q network; 5. Use the trained model for real-time scheduling. In the absence of sufficient training data, the present invention can predict in advance the vehicle demand of each vehicle area in the future time period by combining the linear regression model, the SVM model and the deep reinforcement learning method of the deep Q network, so that the vehicle usage needs of each vehicle use area can be predicted in advance. The shared bicycles in the area are quickly and reasonably dispatched.

Description

Shared bicycle predictive scheduling method based on deep Q network

Technical Field

The invention belongs to the field of shared bicycle scheduling, and particularly relates to a shared bicycle prediction scheduling method based on a deep Q network.

Background

With the progress of society, sharing economy is more and more common around our body. The problem that people go out for the last kilometer is solved to a great extent due to the appearance of the shared bicycle, but the problem of unreasonable placement of the shared bicycle is troubling managers at all levels. The main problem lies in that the quantity of the bicycles in each area is not matched with the requirement of the bicycle, so that a large number of idle bicycles are accumulated in some areas, and the bicycles are not available in some areas. Therefore, how to reasonably allocate the shared resources in each area, so as to avoid the waste of resources, is a difficult problem which always troubles the companies providing the shared services.

The conventional shared bicycle scheduling is mostly based on the idea of average distribution of each area, and the requirement difference of the bicycles in different types of areas and different time periods is rarely considered, so that the situation that the shared bicycles in partial areas are stacked and the bicycles in some areas are not available is caused. The scheduling algorithm is generally applied to various fields in life, such as an elevator scheduling algorithm, a time slice-based round robin scheduling algorithm and the like, and with the vigorous development of big data and artificial intelligence technology, the traditional single scheduling algorithm cannot meet the current requirements.

Disclosure of Invention

The invention aims to solve the defects of the prior art and provides a shared bicycle forecasting and scheduling method based on a deep Q network, so that the vehicle demand of each vehicle area in the future time period can be forecasted in advance, the scheduling of the shared bicycle is more reasonable and efficient, and the sharing advantage is fully exerted.

In order to achieve the purpose, the invention adopts the following technical scheme:

the invention relates to a shared bicycle prediction scheduling method based on a deep Q network, which is characterized by comprising the following steps of:

step 1: establishing a vehicle area model and an unmanned dispatching transport vehicle operation environment model;

step 2: collecting days of all users in the vehicle area kUsual behavior data and the number h of single vehicles in the vehicle area k_kHistorical time t of using car in the shared car area k_kCorresponding number of cars using, wherein the daily behavior data of the jth record generated by the user i in the car using area k is recorded as d_ijAnd is and

a one-hot encoded form vector converted from weather information of the jth record generated by the user i,

time information representing the j-th record generated by user i,

the riding road information of the j-th record generated by the user i comprises a starting point, an end point and a route selection,

representing the j-th recorded car using information generated by the user i, and obtaining a user behavior data matrix as D ═ (D)_ij)_M×N；

And step 3: training a prediction network model consisting of a linear regression model and an SVM model:

step 3.1: constructing a linear regression model and using the historical vehicle-using time t of the vehicle-using region k_kAnd the corresponding vehicle number is used as an input variable, the hyper-parameters of the linear regression model are optimized until the linear regression model converges, and the vehicle number n 'of the prediction model for predicting the future time period of the vehicle using area k is obtained'_k；

Step 3.2: constructing an SVM model, training the SVM model by taking the user behavior data matrix D as an input variable to obtain a trained SVM model for obtaining a classification result, wherein the classification result represents the vehicle using requirement of the user i, if the classification result is 1, the user i needs to use the vehicle,otherwise, the user i does not need to use the vehicle, and the predicted vehicle utilization amount n' of the future time period of the vehicle utilization area k is calculated according to the classification result_k；

Step 3.3: the vehicle using number n 'of future time period of the vehicle using area k'_kAnd the predicted vehicle consumption n' of the future time period_kObtaining the prediction result of the prediction network model by weighted calculation, namely the predicted vehicle consumption n of the future time period of the vehicle area k_k；

And 4, step 4: repeating the step 2 and the step 3 until the estimated vehicle using number of all vehicle using areas is calculated;

and 5: define action command set a ═ { a ═ a₁,…,a_t,…a_m}，a_tRepresents the action information of the unmanned dispatching transport vehicle at the time t, and a_t＝{η_t,κ_t}，η_tIndicating the direction information of the vehicle at time t, k_tShowing whether the dispatching transport vehicle is folded or put down the single vehicle at the time t; define state instruction set s ═ s₀,…,s_t,…s_m}，s_tRepresents the operating environment state information of the unmanned dispatching transport vehicle at the time t, and s_t＝{ρ_t,ι_t,μ_t}，ρ_tInformation indicating the number of vehicles per vehicle zone at time t, i.e. rho_t＝(n_t1,…,n_tk,…,n_tX) Wherein n is_tkIndicating the predicted vehicle using amount of a t-time area k, X indicating the total vehicle using area number, iota_tPosition information, mu, representing a scheduled vehicle at time t_tPosition information indicating each user at time t;

step 6: the reward function R is set using equation (1):

R＝R_pr+R_a+R_n (1)

in the formula (1), R_prA reward function representing a predictive network model and having:

in the formula (2), ζ represents a reward penalty coefficient, and ζ is formed by (0, 1);

in the formula (1), R_aRepresents a fixed action reward for an unmanned vehicle and has:

in the formula (3), e is a constant;

in the formula (1), R_nRepresents a single-vehicle dispatch reward function and has:

in the equation (4), Δ represents whether the unmanned dispatching vehicle is parked in the designated vehicle parking area, Δ ═ 1 represents parking in the designated area, Δ ═ 0 represents non-parking in the designated area, and h_kIndicating the number of existing vehicles in the vehicle area k, c_kRepresenting the number of the single vehicles laid down or taken away by the unmanned dispatching transport vehicle in the vehicle using area k, b is a constant, r represents another reward punishment coefficient, and r is equal to (0, 1);

and 7: setting a learning rate as alpha, a reward attenuation coefficient as gamma, an updating frequency as T, and initializing T as 1;

and 8: constructing a prediction scheduling model based on a deep Q network:

step 8.1: constructing a prediction evaluation network model, comprising: an input layer, one comprising m₁A hidden layer of layers, an FC layer and an output layer, and initializing the network parameters in the prediction evaluation network model to theta by adopting a Gaussian initialization mode₀；

Step 8.2: constructing a prediction target network model with the same structure as the prediction evaluation network model, and initializing network parameters in the prediction target network model into the prediction target network model in a Gaussian initialization mode

And step 9: optimizing, predicting and evaluating parameters of the network model:

step 9.1: calculating an auxiliary adjustment coefficient sigma of the prediction evaluation network model by using the formula (1):

in the formula (5), β represents an error adjustment coefficient, max_k(h_k+c_k-n_k) The maximum error of the predicted vehicle consumption in all the vehicle consumption areas is represented;

step 9.2: the cost function Ψ is calculated using equation (6):

in the formula (6), m represents the total number of states in the model of the operating environment of the unmanned dispatching transport vehicle, Q(s)_t,a_t) Representing the true cumulative reward at time t, Q(s)_t,a_t；θ_t) Represents the cumulative return of the predicted evaluation network model estimate at time t, θ_tA network parameter representing time t;

step 9.3: updating the network parameters of the prediction evaluation network model by using the formula (7):

in the formula (7), θ_tNetwork parameter, theta, of a predictive evaluation network model representing time t_t ^*Network parameters, R, representing a predicted target network model at time t_tThe value of the reward function, Q(s), at time t_t+1,a_t+1；θ_t ^*) An estimate, Q(s), representing the true cumulative return of the predicted target network model at time t_t,a_t；θ_t) Representing the accumulated return of the prediction evaluation network model estimation at the time t;

step 10: according to the updating frequency T, the network parameter theta of the predicted target network model at the updating moment is updated^*Updating the parameter theta of the prediction evaluation network model at the corresponding moment;

step 11: assigning t +1 to t, judging whether t > A is true, if so, indicating that an optimal prediction target network model is obtained, otherwise, returning to the step 9.2 for sequential execution, wherein A is a set threshold value;

step 12: and utilizing the optimal prediction target network model to realize real-time scheduling of the number of the single vehicles in each vehicle area.

Compared with the prior art, the invention has the beneficial effects that:

1. the shared bicycle predictive scheduling method based on the deep Q network overcomes the problem of hysteresis of the traditional scheduling algorithm in the process of processing the problems by a mode of combining prediction and scheduling, thereby greatly improving the utilization rate of the shared bicycle;

2. the prediction network model can sense the vehicle using requirements of the user in advance by combining the advantages of the linear regression model and the SVM model, so that the vehicle can be scheduled in place before the user actually uses the vehicle, and the waiting time of the user is reduced;

3. the reinforcement learning method combined with the prediction model can optimize the hyper-parameters by using the learning mode of experience playback under the condition of insufficient training data, thereby greatly reducing the training cost, improving the efficiency of the model, greatly improving the scheduling efficiency and scheduling timeliness of the shared bicycle, and reducing the scheduling cost of the shared bicycle.

Drawings

FIG. 1 is a diagram of a vehicle region model and an operating environment model of an unmanned dispatching transportation vehicle according to the present invention;

FIG. 2 is a graph of the average reward variation of the optimized optimal predicted objective network model according to the present invention;

fig. 3 is a flowchart of a method for predicting and scheduling a shared single vehicle based on a deep Q network according to the present invention.

Detailed Description

In this embodiment, as shown in fig. 3, a shared bicycle prediction scheduling method based on a deep Q network predicts the vehicle demand of each vehicle area in a future time period in advance by combining a linear regression model, an SVM model and a deep reinforcement learning method of the deep Q network under the condition of lack of sufficient training data, and specifically includes the following steps:

step 1: configuring a simulation environment by using a Tkint tool in a Python GUI library, and establishing a vehicle area model and an unmanned dispatching transport vehicle running environment model, wherein the vehicle area model consists of the following parts: the urban environment is simulated by adopting 5 multiplied by 5 grids, A-F respectively represent six different types of vehicle areas of schools, parks, stadiums, pedestrian streets, office buildings and subway stations, parking areas and parking upper limits of different parking areas are defined in each vehicle area, the parking areas are represented by grey, and numbers are used for distinguishing, so that the vehicle demand difference of different types of areas is simulated, the total number of the vehicles is assumed to be 100, 20, 10, 20 and 20 vehicles are respectively distributed to the six areas during initialization, and the maximum vehicle capacity of the areas is 30, 20, 15, 25, 30 and 50; the unmanned dispatching transport vehicle operation environment model comprises: the position of the unmanned dispatching transport vehicle is represented by a black short solid line, the actions of the unmanned dispatching transport vehicle comprise ascending, descending, left-going, right-going, single vehicle putting down and single vehicle taking away, 10 single vehicles are distributed to the unmanned dispatching transport vehicle during initialization, the dotted line simulates an urban road, a blank area represents that the unmanned dispatching transport vehicle is forbidden to pass through other areas, the black solid lines on the periphery represent a vehicle boundary, and the dispatching vehicle cannot cross the boundary, as shown in figure 1;

step 2: collecting daily behavior data of all users in the vehicle using area k and the number h of single vehicles in the vehicle using area k_kHistorical time t of using car in the shared car area k_kCorresponding number of cars using, wherein the daily behavior data of the jth record generated by the user i in the car using area k is recorded as d_ijAnd is and

time information representing the j-th record generated by user i,

step 3.1: constructing a linear regression model f (t) ═ at²+ bt + c, a, b, c represent three hyper-parameters to be adjusted during training, and the historical vehicle using time t of the vehicle using region k_kAnd the corresponding number f (t) of cars_k) Optimizing hyper-parameters of the linear regression model for input variables until the linear regression model converges, thereby obtaining a forecasting model for forecasting the vehicle using number n 'of the future time period of the vehicle using area k'_k；

Step 3.2: construction of the SVM model δ_i＝sign(ω^*d_ij+d^*)，ω^*、d^*Representing the hyper-parameters needing to be adjusted, training the SVM model by taking the user behavior data matrix D as an input variable to obtain a trained SVM model for obtaining a classification result, wherein the classification result represents the vehicle using requirement of a user i, if the classification result is 1, the user i needs to use the vehicle, otherwise, the user i does not need to use the vehicle, and calculating the predicted vehicle using amount n' of the vehicle using region k in the future time period according to the classification result_k；

Step 3.3: the vehicle using number n 'of future time period of the vehicle using area k'_kAnd the predicted vehicle consumption n' for the future time period_kThrough n_k＝0.4n′_k+0.6n″_kWeighted male ofCalculating to obtain the prediction result of the prediction network model, namely the predicted vehicle consumption n of the future time period of the vehicle-using region k_k；

and 5: define action command set a ═ { a ═ a₁,…,a_t,…a_m}，a_tRepresents the action information of the unmanned dispatching transport vehicle at the time t, and a_t＝{η_t,κ_t}，η_tThe direction information of the vehicle at time t includes upward driving, downward driving, leftward driving, rightward driving, and κ_tShowing whether the dispatching transport vehicle is folded or put down the single vehicle at the time t; define state instruction set s ═ s₀,…,s_t,…s_m}，s_tRepresents the operating environment state information of the unmanned dispatching transport vehicle at the time t, and s_t＝{ρ_t,ι_t,μ_t}，ρ_tInformation indicating the number of vehicles per vehicle zone at time t, i.e. rho_t＝(n_t1,…,n_tk,…,n_tX) Wherein n is_tkIndicating the predicted vehicle using amount of a t-time area k, X indicating the total vehicle using area number, iota_tThe position information of the dispatching transport vehicle at the time t comprises coordinates corresponding to a horizontal axis and a vertical axis, mu_tThe position information of each user at the moment t is shown, namely the current area of each user and whether the vehicle is used or not;

step 6: and (2) setting a reward function R by using the formula (1), wherein when the unmanned dispatching transport vehicle runs in the network, the running environment gives corresponding rewards according to the reward function in combination with the action taken by the unmanned dispatching transport vehicle and the state change of the environment:

R＝R_pr+R_a+R_n (1)

in the formula (3), e is a constant;

in the equation (4), Δ represents whether the unmanned dispatching vehicle is parked in the designated vehicle parking area, Δ ═ 1 represents parking in the designated area, Δ ═ 0 represents no parking in the designated area, and h_kIndicating the number of existing vehicles in the vehicle area k, c_kRepresenting the number of the single vehicles laid down or taken away by the unmanned dispatching transport vehicle in the vehicle using area k, b is a constant, r represents another reward punishment coefficient, and r is equal to (0, 1);

and 7: setting a learning rate as alpha, a reward attenuation coefficient as gamma, a maximum iteration number as A, an updating frequency as T, and initializing T as 1;

and 8: constructing a prediction scheduling model based on a deep Q network:

step 8.1: constructing a prediction evaluation network model, comprising: a 3-layer input layer comprising 13 neurons, a layer comprising m₁M of neuron₂A hidden layer of layers, an FC layer and an output layer containing 7 neurons, and initializing the network parameters in the prediction evaluation network model to theta by adopting a Gaussian initialization mode₀During training, firstly converting time sequence data corresponding to the environment state information into a tensor, then inputting the obtained tensor into a model for training, and outputting the action information of the unmanned dispatching transport vehicle;

step 8.2: constructing a prediction target network model with the same structure as the prediction evaluation network model, and adopting a Gaussian initialization mode to predict a target networkInitialization of network parameters in a network model

step 9.2: the cost function Ψ is calculated using equation (6):

step 9.3: updating the network parameters of the evaluation network model by using the formula (7):

in the formula (7), θ_tNetwork parameter, theta, of a predictive evaluation network model representing time t_t ^*Network parameters, R, representing a predicted target network model at time t_tThe value of the reward function, Q(s), at time t_t+1,a_t+1；θ_t ^*) Estimate representing the true cumulative return of the predicted target network model at time t，Q(s_t,a_t；θ_t) Representing the accumulated return of the prediction evaluation network model estimation at the time t;

step 11: assigning t +1 to t, judging whether t > A is true or not, if so, indicating that an optimal prediction target network model is obtained, otherwise, returning to the step 9.2 for sequential execution, wherein A is a set threshold value, an average reward change curve of the obtained optimal prediction target network model after 100 times of iterative tests is shown in figure 2, a horizontal axis represents iteration times, a vertical axis represents accumulated rewards obtained corresponding to a training period, and finally the average reward is stabilized at about-100;

step 12: and (4) realizing real-time scheduling of the number of the single vehicles in each vehicle area by utilizing the optimal prediction target network model.

Claims

1. A shared bicycle predictive scheduling method based on a deep Q network is characterized by comprising the following steps:

to representThe time information of the j-th record generated by the user i,

Step 3.2: constructing an SVM model, training the SVM model by taking the user behavior data matrix D as an input variable to obtain a trained SVM model for obtaining a classification result, wherein the classification result represents the vehicle using requirement of a user i, if the classification result is 1, the user i needs to use the vehicle, otherwise, the user i does not need to use the vehicle, and calculating the predicted vehicle using amount n' of a vehicle using region k in the future time period according to the classification result_k；

step 6: the reward function R is set using equation (1):

R＝R_pr+R_a+R_n (1)

in the formula (3), e is a constant;

and 8: constructing a prediction scheduling model based on a deep Q network:

step 9.2: the cost function Ψ is calculated using equation (6):