CN115663793B

CN115663793B - Electric automobile low-carbon charge-discharge scheduling method based on deep reinforcement learning

Info

Publication number: CN115663793B
Application number: CN202211225223.9A
Authority: CN
Inventors: 陈实; 郭正伟; 朱亚斌; 杨林森; 刘艺洪
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2022-10-09
Filing date: 2022-10-09
Publication date: 2023-06-23
Anticipated expiration: 2042-10-09
Also published as: CN115663793A

Abstract

The invention provides a low-carbon charging and discharging point scheduling method of an electric automobile based on deep reinforcement learning, which aims at minimizing carbon emission through the coordination and cooperation of the electric automobile, a system energy storage device and a compensation device, and obtains an optimized charging and discharging optimization control strategy of the electric automobile and a system energy storage facility by adopting a deep reinforcement learning algorithm TD3 solution model under the constraint of avoiding the system overload caused by the random access of the electric automobile; the system operation can be optimized by controlling the charging and discharging processes of the electric automobile on the premise of meeting the charging requirement of the electric automobile, renewable energy sources in the system can be effectively consumed, and the carbon emission of the system operation is reduced.

Description

Electric automobile low-carbon charge-discharge scheduling method based on deep reinforcement learning

Technical Field

The invention relates to the technical field of power dispatching, in particular to an electric vehicle low-carbon charging and discharging point dispatching method based on deep reinforcement learning.

Background

According to the related statistical analysis, the power system generates 15% of carbon emission worldwide by burning fossil fuel and the like, and the reduction of the carbon emission of the power system generates huge thrust for green sustainable development. In order to realize carbon emission reduction of the power system, the fundamental means is energy substitution, namely, large-scale renewable energy is used for substituting fossil energy such as coal to generate electricity so as to effectively reduce carbon emission.

However, renewable energy represented by wind and light can reduce the running performance of the system due to randomness, so that the renewable energy needs to be matched with an energy storage system, the energy storage system is arranged, the internet surfing rate of new energy can be improved, meanwhile, the trend distribution of the system is optimized, the frequency modulation and peak shaving of the system are supported, and the running performance of the system can be greatly improved; and as the permeability of the electric automobile is rapidly improved, the influence of the electric automobile on the operation of a power grid is not negligible, meanwhile, the electric automobile has the characteristics of large single-machine capacity, high response speed, certain regularity of time for accessing a system and the like, has great potential for flexible scheduling, can be used as a system-schedulable energy storage device and is used for stabilizing the negative influence of randomness and fluctuation of renewable energy sources such as wind power, photoelectricity and the like on the operation of the system, thereby realizing the operation optimization of the system, improving the utilization rate of wind power and photoelectricity and further reducing the carbon emission of the system.

However, how to optimize the system operation by scheduling the charging and discharging process of the electric automobile on the premise of meeting the charging requirement of the electric automobile, so as to effectively consume renewable energy sources in the system and reduce the carbon emission of the system operation is a problem to be solved.

Disclosure of Invention

Based on the problem to be solved in the prior art, the invention provides a low-carbon charging and discharging point scheduling method for an electric vehicle based on deep reinforcement learning, which aims at minimizing carbon emission through the coordination and cooperation of an electric vehicle, a system energy storage device and a compensation device, and adopts a deep reinforcement learning algorithm (TD 3) to solve a model under the constraint of avoiding system overload caused by Electric Vehicle (EV) random access so as to obtain an optimized charging and discharging optimization control strategy for the electric vehicle and the system energy storage facility; the system operation can be optimized by controlling the charging and discharging processes of the electric automobile on the premise of meeting the charging requirement of the electric automobile, renewable energy sources in the system can be effectively consumed, and the carbon emission of the system operation is reduced.

The invention provides a deep reinforcement learning-based low-carbon charge-discharge scheduling method for an electric automobile, which comprises the following steps of:

s1: every time a decision period passes, a state corresponding to the current time t is obtainedState information s _t ＝{g _t ,e _t ,b _t ,v _t ,t}， g _t Representing the basic information of the power system at the current moment t; e, e _t Information representing an electric vehicle; b _t Information representative of the energy storage device; v _t Information representative of reactive compensation equipment;

s2: status information s to be acquired _t ＝{g _t ,e _t ,b _t ,v _t T } is input into the trained joint scheduling model, and a corresponding scheduling strategy is output; the joint scheduling model is configured into a Markov decision process aiming at the minimum carbon emission, and trains an agent based on a TD3 algorithm;

s3: and executing corresponding control actions according to the scheduling strategy to control the charge and discharge states of the electric vehicles connected with the corresponding charge piles and the equipment states of the corresponding energy storage equipment and reactive compensation equipment in the power system.

According to one possible embodiment, targeting carbon emissions minimization, the objective function is configured to:

wherein m is _n The action cost of each device; w (w) _n The running cost of each device; h is a _n The system is a penalty function and is used for punishing the misoperation of the intelligent agent; c _t Is the carbon emission quantity, P, of the thermal power unit in the running process under a certain load _G The power of the unit is; f (F) _G The coal consumption rate of the unit; ρ is a carbon conversion coefficient; b _G Is the marginal discharge factor of the electric quantity.

According to one possible embodiment, the status information s _t ＝{g _t ,e _t ,b _t ,v _t In t }, e _t ＝{e _is ,e _soc ,e _tar ,e _r ,e _cap ,e _n ,e _ch ,e _dis η, wherein e _is Representing whether the vehicle is connected; e, e _soc SoC level representing current vehicle；e _tar Representing a target amount of power that the vehicle should reach at least at the end of charging; e, e _r Representing the remaining charge time; e, e _cap Representing a vehicle battery capacity; e, e _n Representing the position of the charging pile; e, e _ch Representing vehicle charging power; e, e _dis Representing the charge-vehicle discharge power; η represents the vehicle charge/discharge efficiency.

According to one possible embodiment, the status information s _t ＝{g _t ,e _t ,b _t ,v _t In t, b _t ＝{b _soc ,b _up ,b _low ,b _p ,b _n ,η _b And b is }, where _soc Representing a current SoC level of the energy storage device; b _up And b _low Respectively representing the upper limit and the lower limit of the SoC allowed by the energy storage equipment; b _p Representing the current operating power of the energy storage device; b _n Representing the number of the energy storage device; η (eta) _b Representing the charge and discharge efficiency of the energy storage device;

the corresponding economic costs allocated by the energy storage equipment in the operation process are respectively as follows:

m _b ＝δ _b |△P _b |

w _b ＝α _b |P _b |

wherein m is _b And w _b The power adjustment cost and the operation cost of the energy storage equipment are respectively; delta _b And alpha _b The unit power adjustment cost and the unit power operation cost of the energy storage device are respectively.

According to one possible embodiment, the status information s _t ＝{g _t ,e _t ,b _t ,v _t In t, v _t ＝{v _p ,v _n }, where v _p Representing the current operating power of the reactive compensation equipment v _n Representing the reactive compensation equipment number;

the corresponding economic costs allocated by the reactive compensation equipment when the reactive compensation equipment is depreciated in the running process are respectively as follows:

m _v ＝δ _v |△Q _v |

w _v ＝α _v |Q _v |

wherein m is _v And w _v The power adjustment cost and the operation cost of the reactive compensation equipment are respectively; delta _v And alpha _v The unit power adjustment cost and the unit power operation cost of the reactive compensation equipment are respectively.

According to one possible implementation, when training the agent based on the TD3 algorithm, the objective function is further modified based on the characteristics of the TD3 algorithm to:

wherein, beta represents action strategies adopted by the intelligent agent in the process of exploring the environment in the training process; ρ ^β Representing the action distribution under policy μ; q (Q) ^μ (s, μ (s)) represents the action value that can be generated in the state s when an action is selected according to the policy μ; j (J) _β (mu) represents the expected return that can be obtained when an action is selected according to policy mu.

According to one possible implementation, in training an agent based on the TD3 algorithm, the algorithm alternates between policy evaluation and policy improvement; wherein, the liquid crystal display device comprises a liquid crystal display device,

in the policy evaluation phase, the state-action value, i.e. Q, needs to be calculated ^μ (s, μ (s)), the Q function can be expressed by the Bellman equation:

Q ^μ (s _t ,a _t )＝E[r _t +γQ ^μ (s _t+1 ,a _t+1 )]

after parameterizing the Q function with a neural network, the Q function is approximated by minimizing the Bellman residual:

wherein θ ^Q 、θ ^Q′ Parameters representing the Q network and the Target Q network, respectively;

in the strategy improvement stage, after the Q function is parameterized by the neural network, the objective function J is minimized _β (mu) gradient for updating network parameters, i.e.:

according to one possible implementation manner, in the training process of the intelligent agent based on the TD3 algorithm, the Target value is further evaluated through two Target Q networks with different initial parameters, and the smaller value is selected as the Target value, so that the minimum Bellman residual error correction is required to be:

wherein, the liquid crystal display device comprises a liquid crystal display device,

is a noisy motion; />

And->

Parameters for two different Target Q networks;

and smoothing regularization using the target strategy to enhance stability of the strategy and smooth the Q-function; i.e. in the next state s when calculating the Bellman residual _t+1 Action a taken _t+1 Will be chosen as:

wherein μ represents the Target policy network; epsilon is the added noise, typically selected to be gaussian noise, and its magnitude is sheared to a small extent.

According to one possible implementation, the agent output parameter a is configured _μ And a _std And takes this as the input of the Q network, the actual action a' _te The selection is as follows:

z＝tanh(a _μ +a _std ·ε)

according to one possible implementation, the penalty function h _n Comprising the following steps: penalty function h corresponding to electric automobile _vec Penalty function h corresponding to energy storage device _b The method comprises the steps of carrying out a first treatment on the surface of the Wherein, the liquid crystal display device comprises a liquid crystal display device,

h _vec ＝||(z-a _em )|| ₂

h _b ＝||(a _tb -a _bm )|| ₂

wherein a is _em An auxiliary vector for vehicle motion is equal to a vector obtained by performing motion shearing on z according to boundary conditions; a, a _bm An auxiliary vector for the action of the energy storage battery is equal to a _tb And performing action shearing according to the capacity boundary of the vector.

Description of the drawings:

FIG. 1 is a schematic illustration of the process of the present invention;

FIG. 2 is a schematic diagram of the relationship between the standard coal consumption rate of the thermal power generating unit and the unit load;

fig. 3 is a probability map corresponding to a compressed gaussian strategy adopted by an electric vehicle;

FIG. 4 is a schematic diagram of an architecture of a power distribution system in a simulation experiment;

FIG. 5 is a schematic diagram of simulation results with respect to carbon emissions and various costs;

FIG. 6 is a graph showing the comparison of the effects of training an agent by the TD3 algorithm and the DDPG algorithm;

FIG. 7 is a statistical plot of the number of penalties to an agent during training;

fig. 8 is a statistical chart of the number of power distribution network system occurrence voltage threshold crossing times in the training process. .

Detailed Description

The invention is described in further detail below with reference to the accompanying drawings and specific examples. It should not be construed that the scope of the above subject matter of the present invention is limited to the following embodiments, and all techniques realized based on the present invention are within the scope of the present invention.

In one embodiment of the invention, as shown in fig. 1, the electric vehicle low-carbon charge-discharge scheduling method based on deep reinforcement learning of the invention comprises the following steps:

s1: acquiring state information s corresponding to the current time t every time a decision period passes _t ＝{g _t ,e _t ,b _t ,v _t ,t}， g _t Representing the basic information of the power system at the current moment t; e, e _t Information representing the current time t of the electric automobile; b _t Information representing the energy storage device at the current time t; v _t Information representing the reactive compensation equipment at the current moment t;

In this embodiment, since the Markov decision process can be used to describe a series of history-independent state transition processes, the next state depends only on the current state and the action currently being performed. For the joint scheduling problem studied by the invention, if the power supply, the load and other quantities which are not controlled by the scheduling strategy in each decision period are approximately regarded as constants, the joint scheduling model becomes a sequential model and can be further converted into an MDP model, so that the joint scheduling problem can be solved by using a reinforcement learning algorithm.

Specifically, a typical markov decision process may be represented by a five-tuple { S, a, R, P, γ }, where S represents a set of state spaces, a represents a set of action spaces, R represents a set of return spaces, P represents a set of state transition probabilities, and γ represents a discount rate of return; the invention uses the joint scheduling problem with the minimum carbon emission, and can determine that five-tuple corresponding to MDP is:

(1) S: state is environmental information that an agent can perceive, which will be used as input information to the agent to generate a policy or action. State defined at time t is:

s _t ＝{g _t ,e _t ,b _t ,v _t ,t}

wherein g _t Representing the basic information of the power system at the current moment t; e, e _t Information representing the current time t of the electric automobile; b _t Information representing the energy storage device at the current time t; v _t Information representative of the reactive compensation equipment at the current moment t.

(2) A: an Action is that the agent is perceiving state s _t A policy (or action) that can be output later and will be input into the environment to push the state transition to the next frame. Defining state s _t Action of lower agent output

a _t ＝{a _te ,a _tb ,a _tv }

Wherein alpha is _te Representing the charge and discharge actions of each vehicle; alpha _tb Representing the charge and discharge power of the energy storage device; alpha _tv Representing the operating power of the reactive compensation device.

(3) R: when an agent performs a policy (or action), a certain return can be obtained from the environment, and the return value will be used to evaluate the quality of the policy (or action) taken. In the present invention, reward is defined as:

r _t ＝p _t +c _t

wherein p is _t Representing costs incurred based on results caused by the scheduling policy, generated by a series of auxiliary policies; c _t Representing the amount of carbon emissions generated during the scheduling process.

(4) P: probability represents the state transition Probability. When an agent generates and executes a policy (or action), the environment will control the objects therein to interact according to the policy and transition to the next state. In this process, the uncertainty of the environment itself will cause the interaction process and the next state to be transited to change, so a state transition probability matrix P is generally used to represent the probability of transition to each state after executing the policy. In the model studied by the invention, the output of the electric automobile connected to the system at the next moment and the output of the wind-light power generation system are uncertain, so that even if the same strategy is adopted, the state at the next moment is uncertain for an intelligent agent. The probability matrix P is implicitly generated by monte carlo sampling the environment.

(5) Gamma: the discount factor gamma is used to describe the importance of the payback that can be obtained in the future. When subjected to a complete Markov decision process, the corresponding one will produce a Markov chain, or equivalently a "trace", which will produce a reward:

it is pointed out that, on the one hand, the future rewards are difficult to estimate accurately due to the existence of state transition probabilities, and therefore are of slightly lower importance than the current rewards, while, on the other hand, when there is a sparse reward, the value of one state is closely related to the final result, so that the rewards further later need to be multiplied by a discount factor, resulting in:

wherein, the value range of gamma is [0,1]. The smaller the value of gamma is, the more concerned the intelligent agent is about the return which can be obtained by the current strategy (or action), the shorter the training process is, but the training process is relatively faster to converge; the larger the value of γ, the more emphasis the agent is on the return that the strategy (or action) can get in the future, which appears as a long-term, but the training process converges slowly and may have a saturation problem.

Generally, γ can take a value of 0.9, or let the average step size of the track be n, then the recommended γ value is:

in the present embodiment, with the aim of minimizing the carbon emission amount, the objective function thereof is configured to:

Because the thermal power generating unit consumes fuel and generates certain carbon emission in the operation process, the thermal power generating unit is influenced by the operation power. Generally, the unit will consume more fuel to produce a unit of electricity when operating at low power. The relation between the coal consumption rate and the unit load of the thermal power unit shown in fig. 2 can be used for converting the carbon emission of the unit under a certain load by the curve:

c _t ＝P _G F _G ρ＝P _G b _G

wherein P is _G The power of the unit is; f (F) _G The coal consumption rate of the unit; ρ is a carbon conversion coefficient; b _G Is the marginal discharge factor of the electric quantity.

The running cost and the frequency modulation cost of the thermal power generating unit are as follows:

m _G ＝δ _G △P _G

w _G ＝P _G F _G M _f

wherein m is _G And w _G The power regulation cost and the operation cost of the thermal power generating unit are respectively; delta _G Cost per unit power adjustment; m is M _f Is the fuel cost.

Specifically, the state information s _t ＝{g _t ,e _t ,b _t ,v _t In t }, e _t ＝{e _is ,e _soc ,e _tar ,e _r ,e _cap ,e _n ,e _ch ,e _dis η, wherein e _is Representing whether the vehicle is connected; e, e _soc Representing a current SoC level of the vehicle; e, e _tar Representing a target amount of power that the vehicle should reach at least at the end of charging; e, e _r Representing the remaining charge time; e, e _cap Representing a vehicle battery capacity; e, e _n Representing the position of the charging pile; e, e _ch Representing vehicle charging power; e, e _dis Representing the charge-vehicle discharge power; η represents the vehicle charge/discharge efficiency. Specifically, the charging demand data is processed by combining the change of the current city life on work and rest, the electric automobile charging demand is depicted by using bimodal normal distribution, and a sample is generated by Monte Carlo sampling.

The charging and discharging processes of the electric automobile are respectively described by the following equations:

in the formula e _upper And e _lower Respectively represent the upper and lower limits of the battery power that can ensure the normal service life of the battery.

b _t ＝{b _soc ,b _up ,b _low ,b _p ,b _n ,η _b And b is }, where _soc Representing a current SoC level of the energy storage device; b _up And b _low Respectively representing the upper limit and the lower limit of the SoC allowed by the energy storage equipment; b _p Representing the current operating power of the energy storage device; b _n Representing the number of the energy storage device; η (eta) _b Representing the charge and discharge efficiency of the energy storage device; the charging and discharging process of the energy storage device is carried out by the following stepsDescription of procedure:

the state information s _t ＝{g _t ,e _t ,b _t ,v _t In t, v _t ＝{v _p ,v _n }, where v _p Representing the current operating power of the reactive compensation equipment v _n Representing the reactive compensation equipment number.

Meanwhile, the economic costs allocated to the energy storage equipment in the operation process are respectively as follows:

m _b ＝δ _b |△P _b |

w _b ＝α _b |P _b |

m _v ＝δ _v |△Q _v |

w _v ＝α _v |Q _v |

In this embodiment, when training an agent based on the TD3 algorithm, the objective function is further modified based on the characteristics of the TD3 algorithm to:

wherein, beta represents an action strategy adopted by an agent when exploring an environment in the training process; ρ ^β Representing the action distribution under policy μ; q (Q) ^μ (s, μ (s)) represents the action value that can be generated in the state s when an action is selected according to the policy μ; j (J) _β (mu) represents the expression asThe expected return that can be obtained when an action is selected according to policy μ.

Moreover, in the process of training the intelligent agent by the TD3 algorithm, the algorithm alternately carries out strategy evaluation and strategy improvement; wherein, the liquid crystal display device comprises a liquid crystal display device,

Q ^μ (s _t ,a _t )＝E[r _t +γQ ^μ (s _t+1 ,a _t+1 )]

in addition, in order to reduce the oscillation and error caused by updating the strategy when the Q function is not stable, a strategy delay updating technology is adopted, so that the frequency of occurrence of a strategy improvement stage is lower than that of occurrence of a strategy evaluation stage, namely, the strategy evaluation is carried out when an intelligent agent interacts with the environment each time, but the strategy improvement is carried out only once after a certain number of interactions.

In this embodiment, in the training process of the agent based on the TD3 algorithm, in order to solve the overestimation problem of the Q-value network and smooth the Q-function in the training process, a dual Q-value network and a target policy smoothing regularization technique are used. Specifically, the Target value is evaluated through two Target Q networks with different initial parameters, and a smaller value is selected as the Target value, so that the minimum Bellman residual error is corrected as follows:

is a noisy motion; />

And->

Parameters for two different Target Q networks;

wherein μ represents the Target policy network; epsilon is the added noise, typically selected to be gaussian noise, and its magnitude is sheared to a small extent. By adding noise to the motion, the Q value of the motion when calculating the Bellman residual is allowed to converge to the desired Q value of the motion in the neighborhood, thereby smoothing the Q function. The smoothed Q function is utilized to guide the updating of the strategy network, so that the too fast updating of the strategy network parameters caused by the overlarge gradient of the Q function can be effectively avoided, and the stability of the strategy network is enhanced.

In the present embodiment, the motion vector a is output by the agent _te Representing the charge and discharge actions of each vehicle, should be a discrete output in the form of a binary. However, for the TD3 algorithm, its policy network needs to perform gradient descent on the Q function, which requires that the Q function be tiny. Obviously, the discrete form of output will beA step is caused on the real Q-function (since the Q-network is an approximation of an objectively existing, real Q-function, there is little real irreducibility and instead a gradient explosion, which would cause instability of the network). Moreover, the operation space of TD3 is a continuous operation space rather than a discrete operation space. Thus, the method is applicable to a variety of applications. Configuration of agent output parameter a _μ And a _std And takes this as the input of the Q network, the actual action a' _te The selection is as follows:

z＝tanh(a _μ +a _std ·ε)

through the above operations, the output of the policy network is mapped to a gaussian distribution, and the probability of selection of an actual action is equivalent to the integral of the probability density function over three intervals after the gaussian distribution is compression transformed. As shown in fig. 3, the areas of the gray stripe shade and the blue stripe shade represent the probability of the vehicle performing the discharging and charging actions, respectively, and the shadow area is subject to the parameter a _μ And a _std And (5) controlling. Thus, the Q function is for a _μ And a _std Is differentiable, the corresponding Q value represents the value of the parameter a _μ And a _std Under the condition, the vehicle adopts the expectations of benefits obtained by charging and discharging operations according to a compressed Gaussian strategy; moreover, by using this compressed gaussian strategy, the TD3 algorithm can be made compatible with discrete action output, but at the cost of more samples. In addition, the charge and discharge power of the electric automobile is set according to the gear, so that the charge and discharge power of the electric automobile can be directly replaced by the compressed Gaussian strategy when the automobile can be continuously power-adjusted in consideration of the actual situation.

In this embodiment, the penalty function is set according to different objects when the agent output strategy and the allowed movements of the vehicle networkWhen the space is inconsistent, punishment is required for the improper behavior; for the energy storage equipment, as the energy storage equipment is always connected into the power distribution network, only the operation which can cause the out-of-limit of the battery power of the energy storage equipment is required to be punished; thus, a penalty function h is set _n Comprising the following steps: penalty function h corresponding to electric automobile _vec Penalty function h corresponding to energy storage device _b The method comprises the steps of carrying out a first treatment on the surface of the Wherein, the liquid crystal display device comprises a liquid crystal display device,

h _vec ＝||(z-a _em )|| ₂

h _b ＝||(a _tb -a _bm )|| ₂

In order to further verify the effectiveness of the electric vehicle low-carbon charge-discharge scheduling method based on deep reinforcement learning, simulation experiments are carried out. Specifically, an IEEE 33 node power distribution system is selected as a prototype of the simulation calculation, and partial adjustment is made on the basis of the prototype. As shown in fig. 4, a wind generating set and a photovoltaic generating system are respectively arranged at the 6 th node and the 24 th node, an energy storage device is respectively arranged at the 10 th node and the 16 th node, and a static var compensator is arranged at the 18 th node and the 23 th node; electric vehicle charging stations are provided at

nodes

11 and 30.

Wherein the output data of the wind generating set and the photovoltaic generating system are derived from predictions of the elia. Be pair Aggregate Belgian Wind Farms and Belgium regions during the period of 01/06/2021-30/06/2021 and multiplied by appropriate scaling factors to adapt to the capacity of the power distribution system; the reference electric quantity marginal emission factor is 0.8953t/MWh according to the 2019 annual emission reduction project Chinese regional power grid reference line emission factor, the carbon price is 91.38 percent/t according to the trade average price of European climate exchange 21/02/2022-23/02/2022, and the conversion is carried out according to the international exchange rate 6.99 percent/.

In the experiment, the simulation step length is 1h, the learning rate of the actor and Critic networks and the updating weight of the Target network are respectively set to 10-5,3.0 multiplied by 10-5 and 10-3, the discount factor gamma is 0.98, the batch Size is set to 128, and the buffer Size is set to 105.

The experiment was performed based on the Python and Tensorflow 2.0 frameworks, using a computer configured InterCore i5-9300H CPU@2.40GHz and 1 NVIDIA GTX 1660Ti GPU.

Simulation training is carried out on the joint scheduling model for one week, and after 5000 training rounds, the model converges; as shown in fig. 5, in the initial stage of training, the control strategy is changed more severely, and the carbon emission amount and various costs of the system have stronger fluctuation. After about 2000 training rounds, the model gradually stabilizes and begins to converge, the carbon emissions of the system drop to about 20t, the total operating cost (converted to carbon emissions, and the like) is about 21t, and compared with fig. 5, the ideal carbon emissions and operating cost are 15.3t and 17.2t respectively, which differ greatly. The reason for this phenomenon is that the TD3 algorithm needs to introduce noise to perform policy exploration in the training process, and the noise used in the present invention is N (0,0.09), which will cause greater interference to the policy output by the agent after the performance is stable. This is also why the model, at about 1000 rounds, has a low overall cost and low carbon emissions that were not achieved even shortly after training. Nevertheless, as can still be seen from fig. 6, as training rounds increase, the model can be continuously executed as an excellent scheduling control strategy, proving convergence and robustness of the model performance.

The data used in fig. 6 comes from the test rounds during training (i.e., scheduling control tests were performed without adding noise and learning, set to be performed every 10 training rounds). Meanwhile, in order to verify the performance of the TD3 algorithm, a simulation experiment was performed under the same random seed condition using the DDPG algorithm as a comparison algorithm, and the obtained data are also plotted in fig. 7. The scheduling control strategy generated by the TD3 algorithm is stable, the carbon emission and total cost curve of the system is reduced in a stable manner, and the overall fluctuation is small; the stability of the control strategy obtained by the DDPG algorithm is generally that the carbon emission and the total cost of the system fluctuate within a large range during training, and in addition, the DDPG algorithm is difficult to converge to the optimal scheduling strategy. Furthermore, as the training rounds increase, the model based on the TD3 algorithm reduces the final carbon emissions of the system to about 14.8t and the overall cost to about 16.2t, resulting in an effective reduction in carbon emissions (or overall cost) as compared to the conservative control strategy of fig. 5. In addition, in fig. 5, the frequency modulation cost and the total running cost of the system are about constant, which indicates that the benefit brought by the regulation only through the reactive compensation equipment is very limited, and this proves that about 6% of the relative cost saving in fig. 7 is realized by reasonably scheduling and planning the electric automobile and the energy storage equipment according to the running condition of the system by the trained intelligent body, and also proves that the electric automobile participates in the scheduling to effectively improve the running performance of the system.

FIG. 7 illustrates penalty values for inappropriate behavior of an agent during training. The method can be seen from the figure clearly, the number of times that the intelligent agent sends out the wrong decision is continuously reduced through continuous learning and interaction, and finally the intelligent agent approaches to be free of errors, so that the intelligent agent can effectively identify the SoC of the vehicle and the energy storage battery, and can avoid making decisions harmful to the service life of the battery and the running state of the system according to the SoC.

Figure 8 shows the number of voltage violations that occur during the training process. When electric vehicles are in large quantities in the peak period, larger impact is caused to the power system, and voltage out-of-limit of the system can be caused. The invention aims to reduce the carbon emission of the system on the premise of ensuring economy, so that an intelligent body can distribute electric vehicles which are in rush in a peak period to each time period as much as possible for orderly charging and discharging so as to optimize the system operation, and the system voltage out-of-limit caused by concentrated access of a large amount of loads can be effectively avoided. As shown in the figure, under the condition of adopting a conservation control strategy and in about the first 2500 training rounds, due to the centralized access of the electric automobile and the imperfect scheduling strategy, the system always generates two voltage out-limits in the scheduling for one week, but the frequency of the system generating the voltage out-limits is continuously reduced through continuous learning, so that the effectiveness of the method and the model in optimizing the system operation is proved.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, and alternatives falling within the spirit and principles of the invention.

Claims

1. The electric automobile low-carbon charge-discharge scheduling method based on deep reinforcement learning is characterized by comprising the following steps of:

s1: acquiring state information s corresponding to the current time t every time a decision period passes _t ＝{g _t ,e _t ,b _t ,v _t ,t}，g _t Representing the basic information of the power system at the current moment t; e, e _t Information representing an electric vehicle; b _t Information representative of the energy storage device; v _t Information representative of reactive compensation equipment;

furthermore, the state information s _t ＝{g _t ,e _t ,b _t ,v _t In t }, e _t ＝{e _is ,e _soc ,e _tar ,e _r ,e _cap ,e _n ,e _ch ,e _dis H }, wherein e _is Representing whether the vehicle is connected; e, e _soc Representing a current SoC level of the vehicle; e, e _tar Representing a target amount of power that the vehicle should reach at least at the end of charging; e, e _r Representing the remaining charge time; e, e _cap Representing a vehicle battery capacity; e, e _n Representing the position of the charging pile; e, e _ch Representing vehicle charging power; e, e _dis Representing the charge-vehicle discharge power; η represents the vehicle charge/discharge efficiency;

the state information s _t ＝{g _t ,e _t ,b _t ,v _t In t, b _t ＝{b _soc ,b _up ,b _low ,b _p ,b _n ,h _b And b is }, where _soc Representing a current SoC level of the energy storage device; b _up And b _low Respectively representing the upper limit and the lower limit of the SoC allowed by the energy storage equipment; b _p Representing the current operating power of the energy storage device; b _n Representing the number of the energy storage device; η (eta) _b Representing the charge and discharge efficiency of the energy storage device;

m _b ＝δ _b |△P _b

w _b ＝α _b |P _b |

wherein m is _b And w _b The power adjustment cost and the operation cost of the energy storage equipment are respectively; delta _b And alpha _b The unit power adjustment cost and the unit power operation cost of the energy storage equipment are respectively;

the state information s _t ＝{g _t ,e _t ,b _t ,v _t In t, v _t ＝{v _p ,v _n }, where v _p Representing the current operating power of the reactive compensation equipment v _n Representing the reactive compensation equipment number;

m _v ＝δ _v |△Q _v |

w _v ＝α _v |Q _v |

wherein m is _v And w _v The power adjustment cost and the operation cost of the reactive compensation equipment are respectively; delta _v And alpha _v The unit power adjustment cost and the unit power operation cost of the reactive power compensation equipment are respectively;

moreover, with the aim of minimizing the carbon emission, the objective function thereof is configured to:

wherein m is _n The action cost of each device; w (w) _n For the operation of the devicesCost; h is a _n The system is a penalty function and is used for punishing the misoperation of the intelligent agent; c _t Is the carbon emission quantity, P, of the thermal power unit in the running process under a certain load _G The power of the unit is; f (F) _G The coal consumption rate of the unit; ρ is a carbon conversion coefficient; b _G Is a marginal discharge factor of electric quantity;

when training an agent based on the TD3 algorithm, the objective function is modified into the following steps based on the characteristics of the TD3 algorithm:

wherein, beta represents action strategies adopted by the intelligent agent in the process of exploring the environment in the training process; ρ ^β Representing the action distribution under policy μ; q (Q) ^μ (s, μ (s)) represents the action value that can be generated in the state s when an action is selected according to the policy μ; j (J) _β (μ) represents the expected return that can be obtained when an action is selected according to policy μ;

in the process of training the agent based on the TD3 algorithm, the algorithm alternately carries out strategy evaluation and strategy improvement; wherein, the liquid crystal display device comprises a liquid crystal display device,

Q ^μ (s _t ,a _t )＝E[r _t +γQ ^μ (s _t+1 ,a _t+1 )]

wherein θ ^Q 、θ ^Q Parameters representing the Q network and the Target Q network, respectively;

in the strategy improvement stage, after the Q function is parameterized by the neural network, the objective function J is minimized _β (mu) for updatingGradient of network parameters, namely:

and, further, performing Target value evaluation through two Target Q networks with different initial parameters, and selecting a smaller value as the Target value, so that the minimum Bellman residual error correction is as follows:

is a noisy motion; />

And->

Parameters for two different Target Q networks;

wherein μ represents the Target policy network; epsilon is added noise, typically selected to be gaussian noise, and its amplitude is sheared to limit to a small extent;

further, the agent output parameter a is configured _μ And a _std And takes this as the input of the Q network, the actual action a' _te The selection is as follows:

z＝tanh(a _μ +a _std ·ε)

2. The deep reinforcement learning-based low-carbon charge-discharge scheduling method for electric vehicles of claim 1, wherein the penalty function h _n Comprising the following steps: penalty function h corresponding to electric automobile _vec Penalty function h corresponding to energy storage device _b The method comprises the steps of carrying out a first treatment on the surface of the Wherein, the liquid crystal display device comprises a liquid crystal display device,

h _vec ＝||(z-a _em )|| ₂

h _b ＝||(a _tb -a _bm )|| ₂