CN115663793B - Electric automobile low-carbon charge-discharge scheduling method based on deep reinforcement learning - Google Patents

Electric automobile low-carbon charge-discharge scheduling method based on deep reinforcement learning Download PDF

Info

Publication number
CN115663793B
CN115663793B CN202211225223.9A CN202211225223A CN115663793B CN 115663793 B CN115663793 B CN 115663793B CN 202211225223 A CN202211225223 A CN 202211225223A CN 115663793 B CN115663793 B CN 115663793B
Authority
CN
China
Prior art keywords
representing
energy storage
power
action
vehicle
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211225223.9A
Other languages
Chinese (zh)
Other versions
CN115663793A (en
Inventor
陈实
郭正伟
朱亚斌
杨林森
刘艺洪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan University
Original Assignee
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan University filed Critical Sichuan University
Priority to CN202211225223.9A priority Critical patent/CN115663793B/en
Publication of CN115663793A publication Critical patent/CN115663793A/en
Application granted granted Critical
Publication of CN115663793B publication Critical patent/CN115663793B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/60Other road transportation technologies with climate change mitigation effect
    • Y02T10/70Energy storage systems for electromobility, e.g. batteries

Landscapes

  • Charge And Discharge Circuits For Batteries Or The Like (AREA)
  • Supply And Distribution Of Alternating Current (AREA)

Abstract

The invention provides a low-carbon charging and discharging point scheduling method of an electric automobile based on deep reinforcement learning, which aims at minimizing carbon emission through the coordination and cooperation of the electric automobile, a system energy storage device and a compensation device, and obtains an optimized charging and discharging optimization control strategy of the electric automobile and a system energy storage facility by adopting a deep reinforcement learning algorithm TD3 solution model under the constraint of avoiding the system overload caused by the random access of the electric automobile; the system operation can be optimized by controlling the charging and discharging processes of the electric automobile on the premise of meeting the charging requirement of the electric automobile, renewable energy sources in the system can be effectively consumed, and the carbon emission of the system operation is reduced.

Description

Electric automobile low-carbon charge-discharge scheduling method based on deep reinforcement learning
Technical Field
The invention relates to the technical field of power dispatching, in particular to an electric vehicle low-carbon charging and discharging point dispatching method based on deep reinforcement learning.
Background
According to the related statistical analysis, the power system generates 15% of carbon emission worldwide by burning fossil fuel and the like, and the reduction of the carbon emission of the power system generates huge thrust for green sustainable development. In order to realize carbon emission reduction of the power system, the fundamental means is energy substitution, namely, large-scale renewable energy is used for substituting fossil energy such as coal to generate electricity so as to effectively reduce carbon emission.
However, renewable energy represented by wind and light can reduce the running performance of the system due to randomness, so that the renewable energy needs to be matched with an energy storage system, the energy storage system is arranged, the internet surfing rate of new energy can be improved, meanwhile, the trend distribution of the system is optimized, the frequency modulation and peak shaving of the system are supported, and the running performance of the system can be greatly improved; and as the permeability of the electric automobile is rapidly improved, the influence of the electric automobile on the operation of a power grid is not negligible, meanwhile, the electric automobile has the characteristics of large single-machine capacity, high response speed, certain regularity of time for accessing a system and the like, has great potential for flexible scheduling, can be used as a system-schedulable energy storage device and is used for stabilizing the negative influence of randomness and fluctuation of renewable energy sources such as wind power, photoelectricity and the like on the operation of the system, thereby realizing the operation optimization of the system, improving the utilization rate of wind power and photoelectricity and further reducing the carbon emission of the system.
However, how to optimize the system operation by scheduling the charging and discharging process of the electric automobile on the premise of meeting the charging requirement of the electric automobile, so as to effectively consume renewable energy sources in the system and reduce the carbon emission of the system operation is a problem to be solved.
Disclosure of Invention
Based on the problem to be solved in the prior art, the invention provides a low-carbon charging and discharging point scheduling method for an electric vehicle based on deep reinforcement learning, which aims at minimizing carbon emission through the coordination and cooperation of an electric vehicle, a system energy storage device and a compensation device, and adopts a deep reinforcement learning algorithm (TD 3) to solve a model under the constraint of avoiding system overload caused by Electric Vehicle (EV) random access so as to obtain an optimized charging and discharging optimization control strategy for the electric vehicle and the system energy storage facility; the system operation can be optimized by controlling the charging and discharging processes of the electric automobile on the premise of meeting the charging requirement of the electric automobile, renewable energy sources in the system can be effectively consumed, and the carbon emission of the system operation is reduced.
The invention provides a deep reinforcement learning-based low-carbon charge-discharge scheduling method for an electric automobile, which comprises the following steps of:
s1: every time a decision period passes, a state corresponding to the current time t is obtainedState information s t ={g t ,e t ,b t ,v t ,t}, g t Representing the basic information of the power system at the current moment t; e, e t Information representing an electric vehicle; b t Information representative of the energy storage device; v t Information representative of reactive compensation equipment;
s2: status information s to be acquired t ={g t ,e t ,b t ,v t T } is input into the trained joint scheduling model, and a corresponding scheduling strategy is output; the joint scheduling model is configured into a Markov decision process aiming at the minimum carbon emission, and trains an agent based on a TD3 algorithm;
s3: and executing corresponding control actions according to the scheduling strategy to control the charge and discharge states of the electric vehicles connected with the corresponding charge piles and the equipment states of the corresponding energy storage equipment and reactive compensation equipment in the power system.
According to one possible embodiment, targeting carbon emissions minimization, the objective function is configured to:
Figure BDA0003879394550000021
wherein m is n The action cost of each device; w (w) n The running cost of each device; h is a n The system is a penalty function and is used for punishing the misoperation of the intelligent agent; c t Is the carbon emission quantity, P, of the thermal power unit in the running process under a certain load G The power of the unit is; f (F) G The coal consumption rate of the unit; ρ is a carbon conversion coefficient; b G Is the marginal discharge factor of the electric quantity.
According to one possible embodiment, the status information s t ={g t ,e t ,b t ,v t In t }, e t ={e is ,e soc ,e tar ,e r ,e cap ,e n ,e ch ,e dis η, wherein e is Representing whether the vehicle is connected; e, e soc SoC level representing current vehicle;e tar Representing a target amount of power that the vehicle should reach at least at the end of charging; e, e r Representing the remaining charge time; e, e cap Representing a vehicle battery capacity; e, e n Representing the position of the charging pile; e, e ch Representing vehicle charging power; e, e dis Representing the charge-vehicle discharge power; η represents the vehicle charge/discharge efficiency.
According to one possible embodiment, the status information s t ={g t ,e t ,b t ,v t In t, b t ={b soc ,b up ,b low ,b p ,b nb And b is }, where soc Representing a current SoC level of the energy storage device; b up And b low Respectively representing the upper limit and the lower limit of the SoC allowed by the energy storage equipment; b p Representing the current operating power of the energy storage device; b n Representing the number of the energy storage device; η (eta) b Representing the charge and discharge efficiency of the energy storage device;
the corresponding economic costs allocated by the energy storage equipment in the operation process are respectively as follows:
m b =δ b |△P b |
w b =α b |P b |
wherein m is b And w b The power adjustment cost and the operation cost of the energy storage equipment are respectively; delta b And alpha b The unit power adjustment cost and the unit power operation cost of the energy storage device are respectively.
According to one possible embodiment, the status information s t ={g t ,e t ,b t ,v t In t, v t ={v p ,v n }, where v p Representing the current operating power of the reactive compensation equipment v n Representing the reactive compensation equipment number;
the corresponding economic costs allocated by the reactive compensation equipment when the reactive compensation equipment is depreciated in the running process are respectively as follows:
m v =δ v |△Q v |
w v =α v |Q v |
wherein m is v And w v The power adjustment cost and the operation cost of the reactive compensation equipment are respectively; delta v And alpha v The unit power adjustment cost and the unit power operation cost of the reactive compensation equipment are respectively.
According to one possible implementation, when training the agent based on the TD3 algorithm, the objective function is further modified based on the characteristics of the TD3 algorithm to:
Figure BDA0003879394550000041
wherein, beta represents action strategies adopted by the intelligent agent in the process of exploring the environment in the training process; ρ β Representing the action distribution under policy μ; q (Q) μ (s, μ (s)) represents the action value that can be generated in the state s when an action is selected according to the policy μ; j (J) β (mu) represents the expected return that can be obtained when an action is selected according to policy mu.
According to one possible implementation, in training an agent based on the TD3 algorithm, the algorithm alternates between policy evaluation and policy improvement; wherein, the liquid crystal display device comprises a liquid crystal display device,
in the policy evaluation phase, the state-action value, i.e. Q, needs to be calculated μ (s, μ (s)), the Q function can be expressed by the Bellman equation:
Q μ (s t ,a t )=E[r t +γQ μ (s t+1 ,a t+1 )]
after parameterizing the Q function with a neural network, the Q function is approximated by minimizing the Bellman residual:
Figure BDA0003879394550000042
wherein θ Q 、θ Q′ Parameters representing the Q network and the Target Q network, respectively;
in the strategy improvement stage, after the Q function is parameterized by the neural network, the objective function J is minimized β (mu) gradient for updating network parameters, i.e.:
Figure BDA0003879394550000043
according to one possible implementation manner, in the training process of the intelligent agent based on the TD3 algorithm, the Target value is further evaluated through two Target Q networks with different initial parameters, and the smaller value is selected as the Target value, so that the minimum Bellman residual error correction is required to be:
Figure BDA0003879394550000044
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure BDA0003879394550000045
is a noisy motion; />
Figure BDA0003879394550000046
And->
Figure BDA0003879394550000047
Parameters for two different Target Q networks;
and smoothing regularization using the target strategy to enhance stability of the strategy and smooth the Q-function; i.e. in the next state s when calculating the Bellman residual t+1 Action a taken t+1 Will be chosen as:
Figure BDA0003879394550000051
wherein μ represents the Target policy network; epsilon is the added noise, typically selected to be gaussian noise, and its magnitude is sheared to a small extent.
According to one possible implementation, the agent output parameter a is configured μ And a std And takes this as the input of the Q network, the actual action a' te The selection is as follows:
Figure BDA0003879394550000052
z=tanh(a μ +a std ·ε)
Figure BDA0003879394550000053
according to one possible implementation, the penalty function h n Comprising the following steps: penalty function h corresponding to electric automobile vec Penalty function h corresponding to energy storage device b The method comprises the steps of carrying out a first treatment on the surface of the Wherein, the liquid crystal display device comprises a liquid crystal display device,
h vec =||(z-a em )|| 2
h b =||(a tb -a bm )|| 2
wherein a is em An auxiliary vector for vehicle motion is equal to a vector obtained by performing motion shearing on z according to boundary conditions; a, a bm An auxiliary vector for the action of the energy storage battery is equal to a tb And performing action shearing according to the capacity boundary of the vector.
Description of the drawings:
FIG. 1 is a schematic illustration of the process of the present invention;
FIG. 2 is a schematic diagram of the relationship between the standard coal consumption rate of the thermal power generating unit and the unit load;
fig. 3 is a probability map corresponding to a compressed gaussian strategy adopted by an electric vehicle;
FIG. 4 is a schematic diagram of an architecture of a power distribution system in a simulation experiment;
FIG. 5 is a schematic diagram of simulation results with respect to carbon emissions and various costs;
FIG. 6 is a graph showing the comparison of the effects of training an agent by the TD3 algorithm and the DDPG algorithm;
FIG. 7 is a statistical plot of the number of penalties to an agent during training;
fig. 8 is a statistical chart of the number of power distribution network system occurrence voltage threshold crossing times in the training process. .
Detailed Description
The invention is described in further detail below with reference to the accompanying drawings and specific examples. It should not be construed that the scope of the above subject matter of the present invention is limited to the following embodiments, and all techniques realized based on the present invention are within the scope of the present invention.
In one embodiment of the invention, as shown in fig. 1, the electric vehicle low-carbon charge-discharge scheduling method based on deep reinforcement learning of the invention comprises the following steps:
s1: acquiring state information s corresponding to the current time t every time a decision period passes t ={g t ,e t ,b t ,v t ,t}, g t Representing the basic information of the power system at the current moment t; e, e t Information representing the current time t of the electric automobile; b t Information representing the energy storage device at the current time t; v t Information representing the reactive compensation equipment at the current moment t;
s2: status information s to be acquired t ={g t ,e t ,b t ,v t T } is input into the trained joint scheduling model, and a corresponding scheduling strategy is output; the joint scheduling model is configured into a Markov decision process aiming at the minimum carbon emission, and trains an agent based on a TD3 algorithm;
s3: and executing corresponding control actions according to the scheduling strategy to control the charge and discharge states of the electric vehicles connected with the corresponding charge piles and the equipment states of the corresponding energy storage equipment and reactive compensation equipment in the power system.
In this embodiment, since the Markov decision process can be used to describe a series of history-independent state transition processes, the next state depends only on the current state and the action currently being performed. For the joint scheduling problem studied by the invention, if the power supply, the load and other quantities which are not controlled by the scheduling strategy in each decision period are approximately regarded as constants, the joint scheduling model becomes a sequential model and can be further converted into an MDP model, so that the joint scheduling problem can be solved by using a reinforcement learning algorithm.
Specifically, a typical markov decision process may be represented by a five-tuple { S, a, R, P, γ }, where S represents a set of state spaces, a represents a set of action spaces, R represents a set of return spaces, P represents a set of state transition probabilities, and γ represents a discount rate of return; the invention uses the joint scheduling problem with the minimum carbon emission, and can determine that five-tuple corresponding to MDP is:
(1) S: state is environmental information that an agent can perceive, which will be used as input information to the agent to generate a policy or action. State defined at time t is:
s t ={g t ,e t ,b t ,v t ,t}
wherein g t Representing the basic information of the power system at the current moment t; e, e t Information representing the current time t of the electric automobile; b t Information representing the energy storage device at the current time t; v t Information representative of the reactive compensation equipment at the current moment t.
(2) A: an Action is that the agent is perceiving state s t A policy (or action) that can be output later and will be input into the environment to push the state transition to the next frame. Defining state s t Action of lower agent output
a t ={a te ,a tb ,a tv }
Wherein alpha is te Representing the charge and discharge actions of each vehicle; alpha tb Representing the charge and discharge power of the energy storage device; alpha tv Representing the operating power of the reactive compensation device.
(3) R: when an agent performs a policy (or action), a certain return can be obtained from the environment, and the return value will be used to evaluate the quality of the policy (or action) taken. In the present invention, reward is defined as:
r t =p t +c t
wherein p is t Representing costs incurred based on results caused by the scheduling policy, generated by a series of auxiliary policies; c t Representing the amount of carbon emissions generated during the scheduling process.
(4) P: probability represents the state transition Probability. When an agent generates and executes a policy (or action), the environment will control the objects therein to interact according to the policy and transition to the next state. In this process, the uncertainty of the environment itself will cause the interaction process and the next state to be transited to change, so a state transition probability matrix P is generally used to represent the probability of transition to each state after executing the policy. In the model studied by the invention, the output of the electric automobile connected to the system at the next moment and the output of the wind-light power generation system are uncertain, so that even if the same strategy is adopted, the state at the next moment is uncertain for an intelligent agent. The probability matrix P is implicitly generated by monte carlo sampling the environment.
(5) Gamma: the discount factor gamma is used to describe the importance of the payback that can be obtained in the future. When subjected to a complete Markov decision process, the corresponding one will produce a Markov chain, or equivalently a "trace", which will produce a reward:
Figure BDA0003879394550000081
it is pointed out that, on the one hand, the future rewards are difficult to estimate accurately due to the existence of state transition probabilities, and therefore are of slightly lower importance than the current rewards, while, on the other hand, when there is a sparse reward, the value of one state is closely related to the final result, so that the rewards further later need to be multiplied by a discount factor, resulting in:
Figure BDA0003879394550000082
wherein, the value range of gamma is [0,1]. The smaller the value of gamma is, the more concerned the intelligent agent is about the return which can be obtained by the current strategy (or action), the shorter the training process is, but the training process is relatively faster to converge; the larger the value of γ, the more emphasis the agent is on the return that the strategy (or action) can get in the future, which appears as a long-term, but the training process converges slowly and may have a saturation problem.
Generally, γ can take a value of 0.9, or let the average step size of the track be n, then the recommended γ value is:
Figure BDA0003879394550000083
in the present embodiment, with the aim of minimizing the carbon emission amount, the objective function thereof is configured to:
Figure BDA0003879394550000091
wherein m is n The action cost of each device; w (w) n The running cost of each device; h is a n The system is a penalty function and is used for punishing the misoperation of the intelligent agent; c t Is the carbon emission quantity, P, of the thermal power unit in the running process under a certain load G The power of the unit is; f (F) G The coal consumption rate of the unit; ρ is a carbon conversion coefficient; b G Is the marginal discharge factor of the electric quantity.
Because the thermal power generating unit consumes fuel and generates certain carbon emission in the operation process, the thermal power generating unit is influenced by the operation power. Generally, the unit will consume more fuel to produce a unit of electricity when operating at low power. The relation between the coal consumption rate and the unit load of the thermal power unit shown in fig. 2 can be used for converting the carbon emission of the unit under a certain load by the curve:
c t =P G F G ρ=P G b G
wherein P is G The power of the unit is; f (F) G The coal consumption rate of the unit; ρ is a carbon conversion coefficient; b G Is the marginal discharge factor of the electric quantity.
The running cost and the frequency modulation cost of the thermal power generating unit are as follows:
m G =δ G △P G
w G =P G F G M f
wherein m is G And w G The power regulation cost and the operation cost of the thermal power generating unit are respectively; delta G Cost per unit power adjustment; m is M f Is the fuel cost.
Specifically, the state information s t ={g t ,e t ,b t ,v t In t }, e t ={e is ,e soc ,e tar ,e r ,e cap ,e n ,e ch ,e dis η, wherein e is Representing whether the vehicle is connected; e, e soc Representing a current SoC level of the vehicle; e, e tar Representing a target amount of power that the vehicle should reach at least at the end of charging; e, e r Representing the remaining charge time; e, e cap Representing a vehicle battery capacity; e, e n Representing the position of the charging pile; e, e ch Representing vehicle charging power; e, e dis Representing the charge-vehicle discharge power; η represents the vehicle charge/discharge efficiency. Specifically, the charging demand data is processed by combining the change of the current city life on work and rest, the electric automobile charging demand is depicted by using bimodal normal distribution, and a sample is generated by Monte Carlo sampling.
The charging and discharging processes of the electric automobile are respectively described by the following equations:
Figure BDA0003879394550000101
in the formula e upper And e lower Respectively represent the upper and lower limits of the battery power that can ensure the normal service life of the battery.
b t ={b soc ,b up ,b low ,b p ,b nb And b is }, where soc Representing a current SoC level of the energy storage device; b up And b low Respectively representing the upper limit and the lower limit of the SoC allowed by the energy storage equipment; b p Representing the current operating power of the energy storage device; b n Representing the number of the energy storage device; η (eta) b Representing the charge and discharge efficiency of the energy storage device; the charging and discharging process of the energy storage device is carried out by the following stepsDescription of procedure:
the state information s t ={g t ,e t ,b t ,v t In t, v t ={v p ,v n }, where v p Representing the current operating power of the reactive compensation equipment v n Representing the reactive compensation equipment number.
Meanwhile, the economic costs allocated to the energy storage equipment in the operation process are respectively as follows:
m b =δ b |△P b |
w b =α b |P b |
wherein m is b And w b The power adjustment cost and the operation cost of the energy storage equipment are respectively; delta b And alpha b The unit power adjustment cost and the unit power operation cost of the energy storage device are respectively.
The corresponding economic costs allocated by the reactive compensation equipment when the reactive compensation equipment is depreciated in the running process are respectively as follows:
m v =δ v |△Q v |
w v =α v |Q v |
wherein m is v And w v The power adjustment cost and the operation cost of the reactive compensation equipment are respectively; delta v And alpha v The unit power adjustment cost and the unit power operation cost of the reactive compensation equipment are respectively.
In this embodiment, when training an agent based on the TD3 algorithm, the objective function is further modified based on the characteristics of the TD3 algorithm to:
Figure BDA0003879394550000111
wherein, beta represents an action strategy adopted by an agent when exploring an environment in the training process; ρ β Representing the action distribution under policy μ; q (Q) μ (s, μ (s)) represents the action value that can be generated in the state s when an action is selected according to the policy μ; j (J) β (mu) represents the expression asThe expected return that can be obtained when an action is selected according to policy μ.
Moreover, in the process of training the intelligent agent by the TD3 algorithm, the algorithm alternately carries out strategy evaluation and strategy improvement; wherein, the liquid crystal display device comprises a liquid crystal display device,
in the policy evaluation phase, the state-action value, i.e. Q, needs to be calculated μ (s, μ (s)), the Q function can be expressed by the Bellman equation:
Q μ (s t ,a t )=E[r t +γQ μ (s t+1 ,a t+1 )]
after parameterizing the Q function with a neural network, the Q function is approximated by minimizing the Bellman residual:
Figure BDA0003879394550000112
wherein θ Q 、θ Q′ Parameters representing the Q network and the Target Q network, respectively;
in the strategy improvement stage, after the Q function is parameterized by the neural network, the objective function J is minimized β (mu) gradient for updating network parameters, i.e.:
Figure BDA0003879394550000113
in addition, in order to reduce the oscillation and error caused by updating the strategy when the Q function is not stable, a strategy delay updating technology is adopted, so that the frequency of occurrence of a strategy improvement stage is lower than that of occurrence of a strategy evaluation stage, namely, the strategy evaluation is carried out when an intelligent agent interacts with the environment each time, but the strategy improvement is carried out only once after a certain number of interactions.
In this embodiment, in the training process of the agent based on the TD3 algorithm, in order to solve the overestimation problem of the Q-value network and smooth the Q-function in the training process, a dual Q-value network and a target policy smoothing regularization technique are used. Specifically, the Target value is evaluated through two Target Q networks with different initial parameters, and a smaller value is selected as the Target value, so that the minimum Bellman residual error is corrected as follows:
Figure BDA0003879394550000121
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure BDA0003879394550000122
is a noisy motion; />
Figure BDA0003879394550000123
And->
Figure BDA0003879394550000124
Parameters for two different Target Q networks;
and smoothing regularization using the target strategy to enhance stability of the strategy and smooth the Q-function; i.e. in the next state s when calculating the Bellman residual t+1 Action a taken t+1 Will be chosen as:
Figure BDA0003879394550000125
wherein μ represents the Target policy network; epsilon is the added noise, typically selected to be gaussian noise, and its magnitude is sheared to a small extent. By adding noise to the motion, the Q value of the motion when calculating the Bellman residual is allowed to converge to the desired Q value of the motion in the neighborhood, thereby smoothing the Q function. The smoothed Q function is utilized to guide the updating of the strategy network, so that the too fast updating of the strategy network parameters caused by the overlarge gradient of the Q function can be effectively avoided, and the stability of the strategy network is enhanced.
In the present embodiment, the motion vector a is output by the agent te Representing the charge and discharge actions of each vehicle, should be a discrete output in the form of a binary. However, for the TD3 algorithm, its policy network needs to perform gradient descent on the Q function, which requires that the Q function be tiny. Obviously, the discrete form of output will beA step is caused on the real Q-function (since the Q-network is an approximation of an objectively existing, real Q-function, there is little real irreducibility and instead a gradient explosion, which would cause instability of the network). Moreover, the operation space of TD3 is a continuous operation space rather than a discrete operation space. Thus, the method is applicable to a variety of applications. Configuration of agent output parameter a μ And a std And takes this as the input of the Q network, the actual action a' te The selection is as follows:
Figure BDA0003879394550000131
z=tanh(a μ +a std ·ε)
Figure BDA0003879394550000132
through the above operations, the output of the policy network is mapped to a gaussian distribution, and the probability of selection of an actual action is equivalent to the integral of the probability density function over three intervals after the gaussian distribution is compression transformed. As shown in fig. 3, the areas of the gray stripe shade and the blue stripe shade represent the probability of the vehicle performing the discharging and charging actions, respectively, and the shadow area is subject to the parameter a μ And a std And (5) controlling. Thus, the Q function is for a μ And a std Is differentiable, the corresponding Q value represents the value of the parameter a μ And a std Under the condition, the vehicle adopts the expectations of benefits obtained by charging and discharging operations according to a compressed Gaussian strategy; moreover, by using this compressed gaussian strategy, the TD3 algorithm can be made compatible with discrete action output, but at the cost of more samples. In addition, the charge and discharge power of the electric automobile is set according to the gear, so that the charge and discharge power of the electric automobile can be directly replaced by the compressed Gaussian strategy when the automobile can be continuously power-adjusted in consideration of the actual situation.
In this embodiment, the penalty function is set according to different objects when the agent output strategy and the allowed movements of the vehicle networkWhen the space is inconsistent, punishment is required for the improper behavior; for the energy storage equipment, as the energy storage equipment is always connected into the power distribution network, only the operation which can cause the out-of-limit of the battery power of the energy storage equipment is required to be punished; thus, a penalty function h is set n Comprising the following steps: penalty function h corresponding to electric automobile vec Penalty function h corresponding to energy storage device b The method comprises the steps of carrying out a first treatment on the surface of the Wherein, the liquid crystal display device comprises a liquid crystal display device,
h vec =||(z-a em )|| 2
h b =||(a tb -a bm )|| 2
wherein a is em An auxiliary vector for vehicle motion is equal to a vector obtained by performing motion shearing on z according to boundary conditions; a, a bm An auxiliary vector for the action of the energy storage battery is equal to a tb And performing action shearing according to the capacity boundary of the vector.
In order to further verify the effectiveness of the electric vehicle low-carbon charge-discharge scheduling method based on deep reinforcement learning, simulation experiments are carried out. Specifically, an IEEE 33 node power distribution system is selected as a prototype of the simulation calculation, and partial adjustment is made on the basis of the prototype. As shown in fig. 4, a wind generating set and a photovoltaic generating system are respectively arranged at the 6 th node and the 24 th node, an energy storage device is respectively arranged at the 10 th node and the 16 th node, and a static var compensator is arranged at the 18 th node and the 23 th node; electric vehicle charging stations are provided at nodes 11 and 30.
Wherein the output data of the wind generating set and the photovoltaic generating system are derived from predictions of the elia. Be pair Aggregate Belgian Wind Farms and Belgium regions during the period of 01/06/2021-30/06/2021 and multiplied by appropriate scaling factors to adapt to the capacity of the power distribution system; the reference electric quantity marginal emission factor is 0.8953t/MWh according to the 2019 annual emission reduction project Chinese regional power grid reference line emission factor, the carbon price is 91.38 percent/t according to the trade average price of European climate exchange 21/02/2022-23/02/2022, and the conversion is carried out according to the international exchange rate 6.99 percent/.
In the experiment, the simulation step length is 1h, the learning rate of the actor and Critic networks and the updating weight of the Target network are respectively set to 10-5,3.0 multiplied by 10-5 and 10-3, the discount factor gamma is 0.98, the batch Size is set to 128, and the buffer Size is set to 105.
The experiment was performed based on the Python and Tensorflow 2.0 frameworks, using a computer configured InterCore i5-9300H CPU@2.40GHz and 1 NVIDIA GTX 1660Ti GPU.
Simulation training is carried out on the joint scheduling model for one week, and after 5000 training rounds, the model converges; as shown in fig. 5, in the initial stage of training, the control strategy is changed more severely, and the carbon emission amount and various costs of the system have stronger fluctuation. After about 2000 training rounds, the model gradually stabilizes and begins to converge, the carbon emissions of the system drop to about 20t, the total operating cost (converted to carbon emissions, and the like) is about 21t, and compared with fig. 5, the ideal carbon emissions and operating cost are 15.3t and 17.2t respectively, which differ greatly. The reason for this phenomenon is that the TD3 algorithm needs to introduce noise to perform policy exploration in the training process, and the noise used in the present invention is N (0,0.09), which will cause greater interference to the policy output by the agent after the performance is stable. This is also why the model, at about 1000 rounds, has a low overall cost and low carbon emissions that were not achieved even shortly after training. Nevertheless, as can still be seen from fig. 6, as training rounds increase, the model can be continuously executed as an excellent scheduling control strategy, proving convergence and robustness of the model performance.
The data used in fig. 6 comes from the test rounds during training (i.e., scheduling control tests were performed without adding noise and learning, set to be performed every 10 training rounds). Meanwhile, in order to verify the performance of the TD3 algorithm, a simulation experiment was performed under the same random seed condition using the DDPG algorithm as a comparison algorithm, and the obtained data are also plotted in fig. 7. The scheduling control strategy generated by the TD3 algorithm is stable, the carbon emission and total cost curve of the system is reduced in a stable manner, and the overall fluctuation is small; the stability of the control strategy obtained by the DDPG algorithm is generally that the carbon emission and the total cost of the system fluctuate within a large range during training, and in addition, the DDPG algorithm is difficult to converge to the optimal scheduling strategy. Furthermore, as the training rounds increase, the model based on the TD3 algorithm reduces the final carbon emissions of the system to about 14.8t and the overall cost to about 16.2t, resulting in an effective reduction in carbon emissions (or overall cost) as compared to the conservative control strategy of fig. 5. In addition, in fig. 5, the frequency modulation cost and the total running cost of the system are about constant, which indicates that the benefit brought by the regulation only through the reactive compensation equipment is very limited, and this proves that about 6% of the relative cost saving in fig. 7 is realized by reasonably scheduling and planning the electric automobile and the energy storage equipment according to the running condition of the system by the trained intelligent body, and also proves that the electric automobile participates in the scheduling to effectively improve the running performance of the system.
FIG. 7 illustrates penalty values for inappropriate behavior of an agent during training. The method can be seen from the figure clearly, the number of times that the intelligent agent sends out the wrong decision is continuously reduced through continuous learning and interaction, and finally the intelligent agent approaches to be free of errors, so that the intelligent agent can effectively identify the SoC of the vehicle and the energy storage battery, and can avoid making decisions harmful to the service life of the battery and the running state of the system according to the SoC.
Figure 8 shows the number of voltage violations that occur during the training process. When electric vehicles are in large quantities in the peak period, larger impact is caused to the power system, and voltage out-of-limit of the system can be caused. The invention aims to reduce the carbon emission of the system on the premise of ensuring economy, so that an intelligent body can distribute electric vehicles which are in rush in a peak period to each time period as much as possible for orderly charging and discharging so as to optimize the system operation, and the system voltage out-of-limit caused by concentrated access of a large amount of loads can be effectively avoided. As shown in the figure, under the condition of adopting a conservation control strategy and in about the first 2500 training rounds, due to the centralized access of the electric automobile and the imperfect scheduling strategy, the system always generates two voltage out-limits in the scheduling for one week, but the frequency of the system generating the voltage out-limits is continuously reduced through continuous learning, so that the effectiveness of the method and the model in optimizing the system operation is proved.
The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, and alternatives falling within the spirit and principles of the invention.

Claims (2)

1. The electric automobile low-carbon charge-discharge scheduling method based on deep reinforcement learning is characterized by comprising the following steps of:
s1: acquiring state information s corresponding to the current time t every time a decision period passes t ={g t ,e t ,b t ,v t ,t},g t Representing the basic information of the power system at the current moment t; e, e t Information representing an electric vehicle; b t Information representative of the energy storage device; v t Information representative of reactive compensation equipment;
furthermore, the state information s t ={g t ,e t ,b t ,v t In t }, e t ={e is ,e soc ,e tar ,e r ,e cap ,e n ,e ch ,e dis H }, wherein e is Representing whether the vehicle is connected; e, e soc Representing a current SoC level of the vehicle; e, e tar Representing a target amount of power that the vehicle should reach at least at the end of charging; e, e r Representing the remaining charge time; e, e cap Representing a vehicle battery capacity; e, e n Representing the position of the charging pile; e, e ch Representing vehicle charging power; e, e dis Representing the charge-vehicle discharge power; η represents the vehicle charge/discharge efficiency;
the state information s t ={g t ,e t ,b t ,v t In t, b t ={b soc ,b up ,b low ,b p ,b n ,h b And b is }, where soc Representing a current SoC level of the energy storage device; b up And b low Respectively representing the upper limit and the lower limit of the SoC allowed by the energy storage equipment; b p Representing the current operating power of the energy storage device; b n Representing the number of the energy storage device; η (eta) b Representing the charge and discharge efficiency of the energy storage device;
the corresponding economic costs allocated by the energy storage equipment in the operation process are respectively as follows:
m b =δ b |△P b
w b =α b |P b |
wherein m is b And w b The power adjustment cost and the operation cost of the energy storage equipment are respectively; delta b And alpha b The unit power adjustment cost and the unit power operation cost of the energy storage equipment are respectively;
the state information s t ={g t ,e t ,b t ,v t In t, v t ={v p ,v n }, where v p Representing the current operating power of the reactive compensation equipment v n Representing the reactive compensation equipment number;
the corresponding economic costs allocated by the energy storage equipment in the operation process are respectively as follows:
m v =δ v |△Q v |
w v =α v |Q v |
wherein m is v And w v The power adjustment cost and the operation cost of the reactive compensation equipment are respectively; delta v And alpha v The unit power adjustment cost and the unit power operation cost of the reactive power compensation equipment are respectively;
s2: status information s to be acquired t ={g t ,e t ,b t ,v t T } is input into the trained joint scheduling model, and a corresponding scheduling strategy is output; the joint scheduling model is configured into a Markov decision process aiming at the minimum carbon emission, and trains an agent based on a TD3 algorithm;
moreover, with the aim of minimizing the carbon emission, the objective function thereof is configured to:
Figure FDA0004228619630000021
wherein m is n The action cost of each device; w (w) n For the operation of the devicesCost; h is a n The system is a penalty function and is used for punishing the misoperation of the intelligent agent; c t Is the carbon emission quantity, P, of the thermal power unit in the running process under a certain load G The power of the unit is; f (F) G The coal consumption rate of the unit; ρ is a carbon conversion coefficient; b G Is a marginal discharge factor of electric quantity;
when training an agent based on the TD3 algorithm, the objective function is modified into the following steps based on the characteristics of the TD3 algorithm:
Figure FDA0004228619630000022
wherein, beta represents action strategies adopted by the intelligent agent in the process of exploring the environment in the training process; ρ β Representing the action distribution under policy μ; q (Q) μ (s, μ (s)) represents the action value that can be generated in the state s when an action is selected according to the policy μ; j (J) β (μ) represents the expected return that can be obtained when an action is selected according to policy μ;
in the process of training the agent based on the TD3 algorithm, the algorithm alternately carries out strategy evaluation and strategy improvement; wherein, the liquid crystal display device comprises a liquid crystal display device,
in the policy evaluation phase, the state-action value, i.e. Q, needs to be calculated μ (s, μ (s)), the Q function can be expressed by the Bellman equation:
Q μ (s t ,a t )=E[r t +γQ μ (s t+1 ,a t+1 )]
after parameterizing the Q function with a neural network, the Q function is approximated by minimizing the Bellman residual:
Figure FDA0004228619630000031
wherein θ Q 、θ Q Parameters representing the Q network and the Target Q network, respectively;
in the strategy improvement stage, after the Q function is parameterized by the neural network, the objective function J is minimized β (mu) for updatingGradient of network parameters, namely:
Figure FDA0004228619630000032
and, further, performing Target value evaluation through two Target Q networks with different initial parameters, and selecting a smaller value as the Target value, so that the minimum Bellman residual error correction is as follows:
Figure FDA0004228619630000033
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure FDA0004228619630000034
is a noisy motion; />
Figure FDA0004228619630000035
And->
Figure FDA0004228619630000036
Parameters for two different Target Q networks;
and smoothing regularization using the target strategy to enhance stability of the strategy and smooth the Q-function; i.e. in the next state s when calculating the Bellman residual t+1 Action a taken t+1 Will be chosen as:
Figure FDA0004228619630000037
wherein μ represents the Target policy network; epsilon is added noise, typically selected to be gaussian noise, and its amplitude is sheared to limit to a small extent;
further, the agent output parameter a is configured μ And a std And takes this as the input of the Q network, the actual action a' te The selection is as follows:
Figure FDA0004228619630000041
z=tanh(a μ +a std ·ε)
Figure FDA0004228619630000042
s3: and executing corresponding control actions according to the scheduling strategy to control the charge and discharge states of the electric vehicles connected with the corresponding charge piles and the equipment states of the corresponding energy storage equipment and reactive compensation equipment in the power system.
2. The deep reinforcement learning-based low-carbon charge-discharge scheduling method for electric vehicles of claim 1, wherein the penalty function h n Comprising the following steps: penalty function h corresponding to electric automobile vec Penalty function h corresponding to energy storage device b The method comprises the steps of carrying out a first treatment on the surface of the Wherein, the liquid crystal display device comprises a liquid crystal display device,
h vec =||(z-a em )|| 2
h b =||(a tb -a bm )|| 2
wherein a is em An auxiliary vector for vehicle motion is equal to a vector obtained by performing motion shearing on z according to boundary conditions; a, a bm An auxiliary vector for the action of the energy storage battery is equal to a tb And performing action shearing according to the capacity boundary of the vector.
CN202211225223.9A 2022-10-09 2022-10-09 Electric automobile low-carbon charge-discharge scheduling method based on deep reinforcement learning Active CN115663793B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211225223.9A CN115663793B (en) 2022-10-09 2022-10-09 Electric automobile low-carbon charge-discharge scheduling method based on deep reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211225223.9A CN115663793B (en) 2022-10-09 2022-10-09 Electric automobile low-carbon charge-discharge scheduling method based on deep reinforcement learning

Publications (2)

Publication Number Publication Date
CN115663793A CN115663793A (en) 2023-01-31
CN115663793B true CN115663793B (en) 2023-06-23

Family

ID=84985454

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211225223.9A Active CN115663793B (en) 2022-10-09 2022-10-09 Electric automobile low-carbon charge-discharge scheduling method based on deep reinforcement learning

Country Status (1)

Country Link
CN (1) CN115663793B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115860287B (en) * 2023-03-02 2023-05-12 东方电气集团科学技术研究院有限公司 Low-carbon economical dispatching method for energy storage and generator set

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106159360A (en) * 2016-06-28 2016-11-23 合肥工业大学 A kind of charging electric vehicle method based on mobile charger pattern
CN111934335A (en) * 2020-08-18 2020-11-13 华北电力大学 Cluster electric vehicle charging behavior optimization method based on deep reinforcement learning
CN113515884A (en) * 2021-04-19 2021-10-19 国网上海市电力公司 Distributed electric vehicle real-time optimization scheduling method, system, terminal and medium
CN114971251A (en) * 2022-05-17 2022-08-30 南京逸刻畅行科技有限公司 Shared electric vehicle dispatching method based on deep reinforcement learning

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7761389B2 (en) * 2007-08-23 2010-07-20 Gm Global Technology Operations, Inc. Method for anomaly prediction of battery parasitic load
US11731526B2 (en) * 2020-06-22 2023-08-22 Volta Charging, Llc Systems and methods for identifying characteristics of electric vehicles

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106159360A (en) * 2016-06-28 2016-11-23 合肥工业大学 A kind of charging electric vehicle method based on mobile charger pattern
CN111934335A (en) * 2020-08-18 2020-11-13 华北电力大学 Cluster electric vehicle charging behavior optimization method based on deep reinforcement learning
CN113515884A (en) * 2021-04-19 2021-10-19 国网上海市电力公司 Distributed electric vehicle real-time optimization scheduling method, system, terminal and medium
CN114971251A (en) * 2022-05-17 2022-08-30 南京逸刻畅行科技有限公司 Shared electric vehicle dispatching method based on deep reinforcement learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于深度强化学习的混合动力汽车智能跟车控制与能量管理策略研究;唐小林等;机械工程学报;第57卷(第22期);237-246 *

Also Published As

Publication number Publication date
CN115663793A (en) 2023-01-31

Similar Documents

Publication Publication Date Title
Wu et al. Deep learning adaptive dynamic programming for real time energy management and control strategy of micro-grid
CN112186743B (en) Dynamic power system economic dispatching method based on deep reinforcement learning
CN110365056B (en) Distributed energy participation power distribution network voltage regulation optimization method based on DDPG
US11326579B2 (en) Adaptive dynamic planning control method and system for energy storage station, and storage medium
CN108429288B (en) Off-grid type microgrid energy storage optimization configuration method considering demand response
CN109765787B (en) Power distribution network source load rapid tracking method based on intraday-real-time rolling control
Chen et al. Intelligent energy scheduling in renewable integrated microgrid with bidirectional electricity-to-hydrogen conversion
CN110956324B (en) Day-ahead high-dimensional target optimization scheduling method for active power distribution network based on improved MOEA/D
CN115663793B (en) Electric automobile low-carbon charge-discharge scheduling method based on deep reinforcement learning
CN113408962A (en) Power grid multi-time scale and multi-target energy optimal scheduling method
CN117057553A (en) Deep reinforcement learning-based household energy demand response optimization method and system
CN112952847A (en) Multi-region active power distribution system peak regulation optimization method considering electricity demand elasticity
CN115313380A (en) New energy hydrogen production system coordination control method adaptive to hydrogen load fluctuation
CN115115130A (en) Wind-solar energy storage hydrogen production system day-ahead scheduling method based on simulated annealing algorithm
CN106786702A (en) Full range modeling for mixed energy storage system predicts energy dispatching method
KR20230070779A (en) Demand response management method for discrete industrial manufacturing system based on constrained reinforcement learning
CN117060470B (en) Power distribution network voltage optimization control method based on flexible resources
CN105119285A (en) Wind power storage coordination multi-objective optimization control method based on dynamic weighting
CN116436019B (en) Multi-resource coordination optimization method, device and storage medium
Chen et al. Deep reinforcement learning based research on low‐carbon scheduling with distribution network schedulable resources
Zhao et al. A data‐driven scheduling approach for integrated electricity‐hydrogen system based on improved DDPG
CN115841216A (en) Distribution network energy storage optimization configuration method considering distributed photovoltaic absorption rate
CN115940284A (en) Operation control strategy of new energy hydrogen production system considering time-of-use electricity price
Liu et al. Renewable energy utilizing and fluctuation stabilizing using optimal dynamic grid connection factor strategy and artificial intelligence-based solution method
Tongyu et al. Based on deep reinforcement learning algorithm, energy storage optimization and loss reduction strategy for distribution network with high proportion of distributed generation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CB03 Change of inventor or designer information

Inventor after: Chen Shi

Inventor after: Liu Yihong

Inventor after: Guo Zhengwei

Inventor after: Zhu Yabin

Inventor after: Yang Linsen

Inventor before: Chen Shi

Inventor before: Guo Zhengwei

Inventor before: Zhu Yabin

Inventor before: Yang Linsen

Inventor before: Liu Yihong

CB03 Change of inventor or designer information