Disclosure of Invention
The invention aims to solve the defects of the prior art and provides a shared bicycle forecasting and scheduling method based on a deep Q network, so that the vehicle demand of each vehicle area in the future time period can be forecasted in advance, the scheduling of the shared bicycle is more reasonable and efficient, and the sharing advantage is fully exerted.
In order to achieve the purpose, the invention adopts the following technical scheme:
the invention relates to a shared bicycle prediction scheduling method based on a deep Q network, which is characterized by comprising the following steps of:
step 1: establishing a vehicle area model and an unmanned dispatching transport vehicle operation environment model;
step 2: collecting days of all users in the vehicle area kUsual behavior data and the number h of single vehicles in the vehicle area k
kHistorical time t of using car in the shared car area k
kCorresponding number of cars using, wherein the daily behavior data of the jth record generated by the user i in the car using area k is recorded as d
ijAnd is and
a one-hot encoded form vector converted from weather information of the jth record generated by the user i,
time information representing the j-th record generated by user i,
the riding road information of the j-th record generated by the user i comprises a starting point, an end point and a route selection,
representing the j-th recorded car using information generated by the user i, and obtaining a user behavior data matrix as D ═ (D)
ij)
M×N;
And step 3: training a prediction network model consisting of a linear regression model and an SVM model:
step 3.1: constructing a linear regression model and using the historical vehicle-using time t of the vehicle-using region kkAnd the corresponding vehicle number is used as an input variable, the hyper-parameters of the linear regression model are optimized until the linear regression model converges, and the vehicle number n 'of the prediction model for predicting the future time period of the vehicle using area k is obtained'k;
Step 3.2: constructing an SVM model, training the SVM model by taking the user behavior data matrix D as an input variable to obtain a trained SVM model for obtaining a classification result, wherein the classification result represents the vehicle using requirement of the user i, if the classification result is 1, the user i needs to use the vehicle,otherwise, the user i does not need to use the vehicle, and the predicted vehicle utilization amount n' of the future time period of the vehicle utilization area k is calculated according to the classification resultk;
Step 3.3: the vehicle using number n 'of future time period of the vehicle using area k'kAnd the predicted vehicle consumption n' of the future time periodkObtaining the prediction result of the prediction network model by weighted calculation, namely the predicted vehicle consumption n of the future time period of the vehicle area kk;
And 4, step 4: repeating the step 2 and the step 3 until the estimated vehicle using number of all vehicle using areas is calculated;
and 5: define action command set a ═ { a ═ a1,…,at,…am},atRepresents the action information of the unmanned dispatching transport vehicle at the time t, and at={ηt,κt},ηtIndicating the direction information of the vehicle at time t, ktShowing whether the dispatching transport vehicle is folded or put down the single vehicle at the time t; define state instruction set s ═ s0,…,st,…sm},stRepresents the operating environment state information of the unmanned dispatching transport vehicle at the time t, and st={ρt,ιt,μt},ρtInformation indicating the number of vehicles per vehicle zone at time t, i.e. rhot=(nt1,…,ntk,…,ntX) Wherein n istkIndicating the predicted vehicle using amount of a t-time area k, X indicating the total vehicle using area number, iotatPosition information, mu, representing a scheduled vehicle at time ttPosition information indicating each user at time t;
step 6: the reward function R is set using equation (1):
R=Rpr+Ra+Rn (1)
in the formula (1), RprA reward function representing a predictive network model and having:
in the formula (2), ζ represents a reward penalty coefficient, and ζ is formed by (0, 1);
in the formula (1), RaRepresents a fixed action reward for an unmanned vehicle and has:
in the formula (3), e is a constant;
in the formula (1), RnRepresents a single-vehicle dispatch reward function and has:
in the equation (4), Δ represents whether the unmanned dispatching vehicle is parked in the designated vehicle parking area, Δ ═ 1 represents parking in the designated area, Δ ═ 0 represents non-parking in the designated area, and hkIndicating the number of existing vehicles in the vehicle area k, ckRepresenting the number of the single vehicles laid down or taken away by the unmanned dispatching transport vehicle in the vehicle using area k, b is a constant, r represents another reward punishment coefficient, and r is equal to (0, 1);
and 7: setting a learning rate as alpha, a reward attenuation coefficient as gamma, an updating frequency as T, and initializing T as 1;
and 8: constructing a prediction scheduling model based on a deep Q network:
step 8.1: constructing a prediction evaluation network model, comprising: an input layer, one comprising m1A hidden layer of layers, an FC layer and an output layer, and initializing the network parameters in the prediction evaluation network model to theta by adopting a Gaussian initialization mode0;
Step 8.2: constructing a prediction target network model with the same structure as the prediction evaluation network model, and initializing network parameters in the prediction target network model into the prediction target network model in a Gaussian initialization mode
And step 9: optimizing, predicting and evaluating parameters of the network model:
step 9.1: calculating an auxiliary adjustment coefficient sigma of the prediction evaluation network model by using the formula (1):
in the formula (5), β represents an error adjustment coefficient, maxk(hk+ck-nk) The maximum error of the predicted vehicle consumption in all the vehicle consumption areas is represented;
step 9.2: the cost function Ψ is calculated using equation (6):
in the formula (6), m represents the total number of states in the model of the operating environment of the unmanned dispatching transport vehicle, Q(s)t,at) Representing the true cumulative reward at time t, Q(s)t,at;θt) Represents the cumulative return of the predicted evaluation network model estimate at time t, θtA network parameter representing time t;
step 9.3: updating the network parameters of the prediction evaluation network model by using the formula (7):
in the formula (7), θtNetwork parameter, theta, of a predictive evaluation network model representing time tt *Network parameters, R, representing a predicted target network model at time ttThe value of the reward function, Q(s), at time tt+1,at+1;θt *) An estimate, Q(s), representing the true cumulative return of the predicted target network model at time tt,at;θt) Representing the accumulated return of the prediction evaluation network model estimation at the time t;
step 10: according to the updating frequency T, the network parameter theta of the predicted target network model at the updating moment is updated*Updating the parameter theta of the prediction evaluation network model at the corresponding moment;
step 11: assigning t +1 to t, judging whether t > A is true, if so, indicating that an optimal prediction target network model is obtained, otherwise, returning to the step 9.2 for sequential execution, wherein A is a set threshold value;
step 12: and utilizing the optimal prediction target network model to realize real-time scheduling of the number of the single vehicles in each vehicle area.
Compared with the prior art, the invention has the beneficial effects that:
1. the shared bicycle predictive scheduling method based on the deep Q network overcomes the problem of hysteresis of the traditional scheduling algorithm in the process of processing the problems by a mode of combining prediction and scheduling, thereby greatly improving the utilization rate of the shared bicycle;
2. the prediction network model can sense the vehicle using requirements of the user in advance by combining the advantages of the linear regression model and the SVM model, so that the vehicle can be scheduled in place before the user actually uses the vehicle, and the waiting time of the user is reduced;
3. the reinforcement learning method combined with the prediction model can optimize the hyper-parameters by using the learning mode of experience playback under the condition of insufficient training data, thereby greatly reducing the training cost, improving the efficiency of the model, greatly improving the scheduling efficiency and scheduling timeliness of the shared bicycle, and reducing the scheduling cost of the shared bicycle.
Detailed Description
In this embodiment, as shown in fig. 3, a shared bicycle prediction scheduling method based on a deep Q network predicts the vehicle demand of each vehicle area in a future time period in advance by combining a linear regression model, an SVM model and a deep reinforcement learning method of the deep Q network under the condition of lack of sufficient training data, and specifically includes the following steps:
step 1: configuring a simulation environment by using a Tkint tool in a Python GUI library, and establishing a vehicle area model and an unmanned dispatching transport vehicle running environment model, wherein the vehicle area model consists of the following parts: the urban environment is simulated by adopting 5 multiplied by 5 grids, A-F respectively represent six different types of vehicle areas of schools, parks, stadiums, pedestrian streets, office buildings and subway stations, parking areas and parking upper limits of different parking areas are defined in each vehicle area, the parking areas are represented by grey, and numbers are used for distinguishing, so that the vehicle demand difference of different types of areas is simulated, the total number of the vehicles is assumed to be 100, 20, 10, 20 and 20 vehicles are respectively distributed to the six areas during initialization, and the maximum vehicle capacity of the areas is 30, 20, 15, 25, 30 and 50; the unmanned dispatching transport vehicle operation environment model comprises: the position of the unmanned dispatching transport vehicle is represented by a black short solid line, the actions of the unmanned dispatching transport vehicle comprise ascending, descending, left-going, right-going, single vehicle putting down and single vehicle taking away, 10 single vehicles are distributed to the unmanned dispatching transport vehicle during initialization, the dotted line simulates an urban road, a blank area represents that the unmanned dispatching transport vehicle is forbidden to pass through other areas, the black solid lines on the periphery represent a vehicle boundary, and the dispatching vehicle cannot cross the boundary, as shown in figure 1;
step 2: collecting daily behavior data of all users in the vehicle using area k and the number h of single vehicles in the vehicle using area k
kHistorical time t of using car in the shared car area k
kCorresponding number of cars using, wherein the daily behavior data of the jth record generated by the user i in the car using area k is recorded as d
ijAnd is and
a one-hot encoded form vector converted from weather information of the jth record generated by the user i,
time information representing the j-th record generated by user i,
the riding road information of the j-th record generated by the user i comprises a starting point, an end point and a route selection,
representing the j-th recorded car using information generated by the user i, and obtaining a user behavior data matrix as D ═ (D)
ij)
M×N;
And step 3: training a prediction network model consisting of a linear regression model and an SVM model:
step 3.1: constructing a linear regression model f (t) ═ at2+ bt + c, a, b, c represent three hyper-parameters to be adjusted during training, and the historical vehicle using time t of the vehicle using region kkAnd the corresponding number f (t) of carsk) Optimizing hyper-parameters of the linear regression model for input variables until the linear regression model converges, thereby obtaining a forecasting model for forecasting the vehicle using number n 'of the future time period of the vehicle using area k'k;
Step 3.2: construction of the SVM model δi=sign(ω*dij+d*),ω*、d*Representing the hyper-parameters needing to be adjusted, training the SVM model by taking the user behavior data matrix D as an input variable to obtain a trained SVM model for obtaining a classification result, wherein the classification result represents the vehicle using requirement of a user i, if the classification result is 1, the user i needs to use the vehicle, otherwise, the user i does not need to use the vehicle, and calculating the predicted vehicle using amount n' of the vehicle using region k in the future time period according to the classification resultk;
Step 3.3: the vehicle using number n 'of future time period of the vehicle using area k'kAnd the predicted vehicle consumption n' for the future time periodkThrough nk=0.4n′k+0.6n″kWeighted male ofCalculating to obtain the prediction result of the prediction network model, namely the predicted vehicle consumption n of the future time period of the vehicle-using region kk;
And 4, step 4: repeating the step 2 and the step 3 until the estimated vehicle using number of all vehicle using areas is calculated;
and 5: define action command set a ═ { a ═ a1,…,at,…am},atRepresents the action information of the unmanned dispatching transport vehicle at the time t, and at={ηt,κt},ηtThe direction information of the vehicle at time t includes upward driving, downward driving, leftward driving, rightward driving, and κtShowing whether the dispatching transport vehicle is folded or put down the single vehicle at the time t; define state instruction set s ═ s0,…,st,…sm},stRepresents the operating environment state information of the unmanned dispatching transport vehicle at the time t, and st={ρt,ιt,μt},ρtInformation indicating the number of vehicles per vehicle zone at time t, i.e. rhot=(nt1,…,ntk,…,ntX) Wherein n istkIndicating the predicted vehicle using amount of a t-time area k, X indicating the total vehicle using area number, iotatThe position information of the dispatching transport vehicle at the time t comprises coordinates corresponding to a horizontal axis and a vertical axis, mutThe position information of each user at the moment t is shown, namely the current area of each user and whether the vehicle is used or not;
step 6: and (2) setting a reward function R by using the formula (1), wherein when the unmanned dispatching transport vehicle runs in the network, the running environment gives corresponding rewards according to the reward function in combination with the action taken by the unmanned dispatching transport vehicle and the state change of the environment:
R=Rpr+Ra+Rn (1)
in the formula (1), RprA reward function representing a predictive network model and having:
in the formula (2), ζ represents a reward penalty coefficient, and ζ is formed by (0, 1);
in the formula (1), RaRepresents a fixed action reward for an unmanned vehicle and has:
in the formula (3), e is a constant;
in the formula (1), RnRepresents a single-vehicle dispatch reward function and has:
in the equation (4), Δ represents whether the unmanned dispatching vehicle is parked in the designated vehicle parking area, Δ ═ 1 represents parking in the designated area, Δ ═ 0 represents no parking in the designated area, and hkIndicating the number of existing vehicles in the vehicle area k, ckRepresenting the number of the single vehicles laid down or taken away by the unmanned dispatching transport vehicle in the vehicle using area k, b is a constant, r represents another reward punishment coefficient, and r is equal to (0, 1);
and 7: setting a learning rate as alpha, a reward attenuation coefficient as gamma, a maximum iteration number as A, an updating frequency as T, and initializing T as 1;
and 8: constructing a prediction scheduling model based on a deep Q network:
step 8.1: constructing a prediction evaluation network model, comprising: a 3-layer input layer comprising 13 neurons, a layer comprising m1M of neuron2A hidden layer of layers, an FC layer and an output layer containing 7 neurons, and initializing the network parameters in the prediction evaluation network model to theta by adopting a Gaussian initialization mode0During training, firstly converting time sequence data corresponding to the environment state information into a tensor, then inputting the obtained tensor into a model for training, and outputting the action information of the unmanned dispatching transport vehicle;
step 8.2: constructing a prediction target network model with the same structure as the prediction evaluation network model, and adopting a Gaussian initialization mode to predict a target networkInitialization of network parameters in a network model
And step 9: optimizing, predicting and evaluating parameters of the network model:
step 9.1: calculating an auxiliary adjustment coefficient sigma of the prediction evaluation network model by using the formula (1):
in the formula (5), β represents an error adjustment coefficient, maxk(hk+ck-nk) The maximum error of the predicted vehicle consumption in all the vehicle consumption areas is represented;
step 9.2: the cost function Ψ is calculated using equation (6):
in the formula (6), m represents the total number of states in the model of the operating environment of the unmanned dispatching transport vehicle, Q(s)t,at) Representing the true cumulative reward at time t, Q(s)t,at;θt) Represents the cumulative return of the predicted evaluation network model estimate at time t, θtA network parameter representing time t;
step 9.3: updating the network parameters of the evaluation network model by using the formula (7):
in the formula (7), θtNetwork parameter, theta, of a predictive evaluation network model representing time tt *Network parameters, R, representing a predicted target network model at time ttThe value of the reward function, Q(s), at time tt+1,at+1;θt *) Estimate representing the true cumulative return of the predicted target network model at time t,Q(st,at;θt) Representing the accumulated return of the prediction evaluation network model estimation at the time t;
step 10: according to the updating frequency T, the network parameter theta of the predicted target network model at the updating moment is updated*Updating the parameter theta of the prediction evaluation network model at the corresponding moment;
step 11: assigning t +1 to t, judging whether t > A is true or not, if so, indicating that an optimal prediction target network model is obtained, otherwise, returning to the step 9.2 for sequential execution, wherein A is a set threshold value, an average reward change curve of the obtained optimal prediction target network model after 100 times of iterative tests is shown in figure 2, a horizontal axis represents iteration times, a vertical axis represents accumulated rewards obtained corresponding to a training period, and finally the average reward is stabilized at about-100;
step 12: and (4) realizing real-time scheduling of the number of the single vehicles in each vehicle area by utilizing the optimal prediction target network model.