CN113627993A

CN113627993A - Intelligent electric vehicle charging and discharging decision method based on deep reinforcement learning

Info

Publication number: CN113627993A
Application number: CN202110989593.9A
Authority: CN
Inventors: 姚翰林
Original assignee: Northeastern University Qinhuangdao Branch
Current assignee: Northeastern University Qinhuangdao Branch
Priority date: 2021-08-26
Filing date: 2021-08-26
Publication date: 2021-11-09

Abstract

The invention provides an intelligent electric vehicle charging and discharging decision method based on deep reinforcement learning, and relates to the technical field of electric vehicle charging and discharging. The machine learning algorithm based on data driving can be used for solving a complex optimization problem, prior knowledge of a system is not needed, dynamic learning of a historical running state can be completed based on function iteration, and an optimal charging and discharging plan is obtained based on experience accumulation and return analysis. From the perspective of a user, an MDP with unknown transition probability is constructed and used for describing the charging and discharging scheduling problem of the electric automobile. The randomness of electricity prices and commuting behaviors are considered to describe an actual scene; the method does not need any system model information to determine the optimal decision of the real-time decision problem; the electricity price is predicted iteratively by using the single-step prediction LSTM network, and the prediction precision is higher compared with that of a traditional time series prediction method (Arima).

Description

Intelligent electric vehicle charging and discharging decision method based on deep reinforcement learning

Technical Field

The invention relates to the technical field of electric vehicle charging and discharging, in particular to an intelligent electric vehicle charging and discharging decision method based on deep reinforcement learning.

Background

With the improvement of the living standard of residents and the intelligent level of electric automobiles, more factors, such as accumulated charging and discharging cost and user using satisfaction, can be considered when people charge and discharge the electric automobiles. For many households with electric vehicles, if the electric vehicle is selected to be charged at the maximum power when parked, although the user satisfaction is high, the charging cost is high; although the charging cost can be reduced if charging at a low electricity price is always selected, user satisfaction may be reduced if the electric vehicle is not fully charged when it leaves.

The electric automobile is different from other controllable loads and energy storage equipment, the travel requirements of various users need to be met firstly, the regulation and control of the electric automobile must be on the premise of meeting the travel and charge-discharge willingness of the users, and under the actual operation conditions that wind power output and photovoltaic output are uncertain, the load of the users fluctuates randomly, the topological structure of a power distribution network is flexible and changeable and the like, the traditional modeling optimization method is difficult to find the optimal solution of the charge-discharge electric quantity of the electric automobile which accords with the reality.

Because the energy consumption and electricity prices of electric vehicles are dynamic and time-varying due to arrival and departure times of electric vehicles, it becomes challenging to efficiently manage Electric Vehicle (EV) charging/discharging to reduce costs. In recent years, many day-ahead scheduling methods have been proposed for this problem. Although some success has been achieved in day-ahead charge/discharge scheduling, they are not suitable for real-time scenarios. A real-time scheduling strategy capable of responding to dynamic charging demand and time-varying electricity prices has recently attracted much attention, and the scheduling problem can be expressed as a model-based control problem, however, the following limitations are prevalent in the actual modeling process: 1. the fluctuating electricity price in hours has certain flexibility, and there is a great delay in transmitting the actual electricity price to the electric vehicle for control. 2. Most load control methods need to accurately model electric vehicles, and because the internal structures of the electric vehicles are different, different types of electric vehicles have different models, many parameters of the electric vehicles need to be known in advance in the modeling, the modeling is complex, and the respective modeling difficulty is high. Therefore, the model-free intelligent electric vehicle charging and discharging decision method based on the predicted electricity price and suitable for different types of electric vehicles is of great significance for solving the problems of high flexibility of the electricity price and poor applicability of the different types of electric vehicles.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides an intelligent electric vehicle charging and discharging decision method based on deep reinforcement learning. The machine learning algorithm based on data driving can be used for solving a complex optimization problem, prior knowledge of a system is not needed, dynamic learning of a historical running state can be completed based on function iteration, and an optimal charging and discharging plan is obtained based on experience accumulation and return analysis.

In order to solve the technical problems, the invention adopts the following technical scheme:

an intelligent electric vehicle charging and discharging decision method based on deep reinforcement learning specifically comprises the following steps:

step 1: collecting the electricity price data of the past 24 hours;

step 2: using a single-step prediction LSTM network to perform iterative prediction on the electricity price of 24 hours in the future;

step 2.1: the LSTM network is expanded into 23 layers of neural networks, and the same weight parameters are used for each layer;

step 2.2: let the input d of the first layer_t-22＝p_t-22-p_t-23Wherein p is_t-22And p_t-23Representing the electricity prices, y, of time steps t-22 and t-23, respectively_t-22Representing the output of the first layer, W and R are weight matrices of the LSTM gate structure; and c_t-22Indicating a cell state including past power rate information;

step 2.3: y containing past electricity rate information_t-22And c_t-22The electricity prices are sequentially transmitted to the second layer to the last layer, namely the electricity prices of 1 time step in the future are predicted;

step 2.4: repeating the steps 2.1-2.3 until the electricity price of the future 24 hours is predicted in an iterative mode;

and step 3: a DQN reinforcement learning method is introduced, an intelligent agent for controlling the load of the electric automobile is trained based on a neural network, and the optimal control decision is obtained by automatically learning the optimization process of the charge and discharge decision of the electric automobile through the observation of the predicted electricity price and the SOC of the residual electric quantity of the battery in the current hour and the obtained reward;

step 3.1:initializing an experience pool D and estimating action value network parameters Q_θAnd target action value network parameters

Respectively selecting an initial SOC and arrival and departure times of the electric automobile from the truncated normal distribution;

step 3.2: the intelligent load control body of the electric automobile has 7 power selections, and the action space is recorded as A ═ 6kW, 4kW, 2kW, 0kW, -2kW, -4kW, -6kW]The electric automobile selects the action arg max by a greedy algorithm according to the probability of epsilon_aQ(S_tA; theta), randomly choosing action a with a probability of (1-epsilon)_t. Wherein s is_tIs the environment at time step t, a is the environment s_tA next selectable action, θ represents a parameter of the Q network;

step 3.3: t the observed value of the state of the time step is s_t＝(u_t,E_t,P_t-23,...,P_t) Wherein parameter (P)_t-23,...,P_t) Representing the electricity price at t hour at t time step, parameter E_tRepresents the remaining energy in the battery of the electric vehicle, u_tIndicates whether the EV is at home;

step 3.4: state transition s_t+1＝f(s_t,a_t)，E_tThe state transition of (1) is from action a at time step t_tControl by a deterministic battery model E_t+1＝E_t+a_tRepresents; for u_tAnd P_tSince the arrival time, the departure time and the next hour of electricity price are unknown, the state transition has randomness;

step 3.5: reward function r_t＝-(d*n*a_t*p)/10-λ*((1-soc)*C)²Wherein d is the proportion of the time occupied by the complete charge and discharge of the vehicle; during discharging, d is (soc/(rate/C), and during charging, d is (1-soc)/(rate/C), wherein rate is charging and discharging power, n is a time step, where 1 is taken as one hour, p is real-time electricity price, and λ is a penalty coefficient and is a decrement term; c is the battery capacity, and a penalty term is included if t +1 is the last time step and the electric vehicle is not fully charged; soc is the battery occupied by the residual electric quantity of the battery of the electric automobileThe ratio of the capacities;

step 3.6: will(s)_t,a_t,r_t,s_t+1) Recording the data into an experience pool D;

step 3.7: random selection of minimum batch sample quadruples from replay memory D

Wherein # F is the number of tuples of the minimum batch sampling, j is 1, 2., # F;

step 3.8: network parameters according to target action values

Calculating a target action value independent of the estimated value network parameter theta

Is calculated as

Wherein

Parameters of the target network; gamma is the discount coefficient, and the value range [0,1 ]]Q is an action value;

step 3.9: minimizing loss function

Carrying out back propagation by a gradient descent method to update the parameters of the estimated value network theta;

step 3.10: repeating the step 3.2-3.9, and copying the estimated action value network parameter to the target action value network parameter every other set step to update the target action value network parameter;

step 3.11: repeating steps 3.1-3.10 until learning a strategy pi that maximizes the cumulative prize value R, wherein

Step 3.12: optimal selection among known DQNActing as a_tUnder the competitive structure, the action value function decomposes to: q (s, a) ═ v(s) + a (s, a) at this time, optimum action a^*Argmax (a (s, a)). Where V(s) is the state cost function and A (s, a) is the action dominance function.

The invention has the following beneficial effects:

the invention provides an intelligent electric vehicle charging and discharging decision method based on deep reinforcement learning, which has the following beneficial effects:

1. from the perspective of a user, the invention constructs an MDP with unknown transition probability for describing the charging and discharging scheduling problem of the electric automobile. The randomness of electricity prices and commuting behaviors are considered to describe an actual scene;

2. the method does not need any system model information to determine the optimal decision of the real-time decision problem;

3. the electricity price is iteratively predicted by using the single-step prediction LSTM network, and the prediction precision is higher compared with that of a traditional time series prediction method (Arima);

4. a competition structure is added at the DQN output end, the Q value is divided into the sum of a state cost function and an advantage function of specific action in a related state, the problem of overestimation of the DQN value function is effectively solved, the generalization capability of a model is enhanced, and the problems of noise and instability caused by large absolute value difference of the Q function in action and state dimensions in a traditional DQN algorithm are solved.

Drawings

FIG. 1 is a general flow chart of the intelligent electric vehicle charging and discharging decision method according to the present invention;

FIG. 2 is a block diagram of an LSTM network according to an embodiment of the present invention;

FIG. 3 is a graph of the training effect of DQN and a competition depth Q network (dulling-DQN) in the embodiment of the present invention;

FIG. 4 is a graph of the cumulative charge-discharge costs of DQN and dulling-DQN in the example of the present invention.

Detailed Description

The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.

An intelligent electric vehicle charging and discharging decision method based on deep reinforcement learning is shown in fig. 1 and specifically comprises the following steps:

step 1: collecting the electricity price data of the past 24 hours;

step 2: using a single-step prediction LSTM network to perform iterative prediction on the electricity price of 24 hours in the future, as shown in FIG. 2;

step 3.1: initializing an experience pool D and estimating action value network parameters Q_θAnd target action value network parameters

step 3.2: the intelligent load control body of the electric automobile has 7 power optionsTaking, and recording the action space as A ═ 6kW, 4kW, 2kW, 0kW, -2kW, -4kW, -6kW]The electric automobile selects the action arg max by a greedy algorithm according to the probability of epsilon_a Q(S_tA; theta), randomly choosing action a with a probability of (1-epsilon)_t. Wherein s is_tIs the environment at time step t, a is the environment s_tA next selectable action, θ represents a parameter of the Q network;

step 3.5: reward function r_t＝-(d*n*a_t*p)/10-λ*((1-soc)*C)²Wherein d is the proportion of the time occupied by the complete charge and discharge of the vehicle; during discharging, d is (soc/(rate/C), and during charging, d is (1-soc)/(rate/C), wherein rate is charging and discharging power, n is a time step, where 1 is taken as one hour, p is real-time electricity price, and λ is a penalty coefficient and is a decrement term; c is the battery capacity, and a penalty term is included if t +1 is the last time step and the electric vehicle is not fully charged; the soc is the proportion of the residual electric quantity of the battery of the electric automobile to the capacity of the battery;

Wherein # F is the mostThe number of tuples sampled in small batches;

step 3.8: network parameters according to target action values

Is calculated as

Wherein

Parameters of the target network; gamma is the discount coefficient, and the value range [0,1 ]]Penalizing on a time basis to achieve better performance, if γ equals 0, meaning only looking at the current reward; if γ is equal to 1, meaning that the environment is determined, the same action will always receive the same reward; q is an action value;

step 3.9: minimizing loss function

Step 3.12: the best action among the known DQNs is selected as a_tUnder the competitive structure, the action value function decomposes to: q (s, a) ═ v(s) + a (s, a) at this time, optimum action a^*Argmax (a (s, a)). Where V(s) is the state cost function and A (s, a) is the action dominance function.

The method is divided into two stages, wherein the first stage is an LSTM network electricity price forecasting stage, and the second stage is a DQN method training agent to obtain an optimal strategy stage.

In the present invention, the electricity price trend is captured by the LSTM network. Its input is the electricity prices of the past 24 time steps and its output is the electricity prices of the future 1 time step. The idea behind LSTM networks is to utilize sequential information such as real-time electricity prices. The LSTM network performs the same processing on each element of the sequence, the output of which depends on the previous calculation. The calculated information may be stored in the LSTM unit so far. For this EV charging scheduling problem, the LSTM network will be expanded to a 23-layer neural network. Specifically, the input to the first layer is d_t-22＝p_t-22-p_t-23Wherein p is_t-22And p_t-23Representing the electricity prices for time steps t-22 and t-23, respectively. W and R represent the respective parameters shared between all layers. y is_t-22Represents the output of the first layer, and c_t-22Indicating its cell state. Y containing past electricity rate information_t-22And c_t-22Is passed to the second layer and the process is repeated until the last layer. It can be seen from the expanded view that the output of each layer can be passed to the next neuron. The weight parameters used by each layer after the LSTM neural network is expanded are the same, which greatly simplifies the difficulty of the neural network parameter training.

Output y of the LSTM network_tIn series with a scalar battery SOC. These series characteristics contain information on predicted future rates of electricity and battery SOC. Information on future electricity prices is important to reduce charging costs, while information on battery SOC is important to ensure that the EV is fully charged. Then, these series characteristics are fed into the competitive Q network to obtain the action dominance function A (s, a) corresponding to each action_i) Choose the optimal action a^*The optimal charge and discharge plan is obtained as argmax (a (s, a)).

In reinforcement learning, it is necessary to estimate the value of each state, but it is not necessary to estimate the value of each action for many states. Representation and status of status value by competing network structureThe following action advantages were evaluated separately. State-action cost function Q_π(s, a) represents the expected reward value for selecting action a by policy π in state s, the value of state V_π(s) represents the value of state s, which is the expected value of all actions worth produced by strategy π in that state, and the difference between the two represents the advantage of selecting action a in state s, defined as

A_π(s,a)＝Q_π(s,a)-V_π(s)

Thus, there are two data streams for the contention network, one outputting a state value V (s; θ, β) and the other outputting an action dominance A (s, a; θ, α). Wherein, theta represents a network neuron parameter for performing feature processing on an input layer, and the parameter is the weight of each layer of the network in the neural network; α, β are parameters of the two streams, respectively. The output of the deep Q network adopting the competition network structure is

Q(s,a；θ,α,β)＝V(s；θ,β)+A(s,a；θ,α)

When a competitive network structure is actually applied, the average value of the motion advantages is usually used for solving the maximum value of the motion advantages in the calculation of the Q value, so that the performance is ensured, and the optimization stability is improved. As can be seen from fig. 3, dulling-DQN has less overall loss fluctuation and converges faster than DQN.

As can be seen from fig. 4, the charging and discharging costs using the dulling-DQN method are generally lower than the DQN method for randomly selected 100 days.

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions and scope of the present invention as defined in the appended claims.

Claims

1. An intelligent electric vehicle charging and discharging decision method based on deep reinforcement learning is characterized by comprising the following steps:

step 1: collecting the electricity price data of the past 24 hours;

and step 3: a DQN reinforcement learning method is introduced, an intelligent agent for controlling the load of the electric automobile is trained based on a neural network, the optimization process of the charge and discharge decision of the electric automobile is automatically learned through the observation of the predicted electricity price and the SOC of the battery residual electricity and the obtained reward in the current hour, and the optimal control decision is obtained.

2. The intelligent electric vehicle charging and discharging decision method based on deep reinforcement learning of claim 1, wherein the step 2 specifically comprises the following steps:

step 2.4: and repeating the steps 2.1-2.3 until the future 24-hour electricity price is predicted in an iterative mode.

3. The intelligent electric vehicle charging and discharging decision method based on deep reinforcement learning of claim 1, wherein the step 3 specifically comprises the following steps:

step 3.2: the intelligent load control body of the electric automobile has 7 power selections, and the action space is recorded as A ═ 6kW, 4kW, 2kW, 0kW, -2kW, -4kW, -6kW]The electric automobile selects the action argmax by a greedy algorithm according to the probability of epsilon_aQ(S_tA; theta), randomly choosing action a with a probability of (1-epsilon)_tWherein s is_tIs the environment at time step t, a is the environment s_tA next selectable action, θ represents a parameter of the Q network;

step (ii) of3.7: random selection of minimum batch sample quadruples from replay memory D

step 3.8: network parameters according to target action values

Is calculated as

Wherein

step 3.9: minimizing loss function

Step 3.12: the best action among the known DQNs is selected as a_tUnder the competition structure, the action value function decomposes to: q (s, a) ═ v(s) + a (s, a) at this timeOptimum action a^*Argmax (a (s, a)), where v(s) is the state cost function and a (s, a) is the action dominance function.