CN117595346B

CN117595346B - Charge-discharge strategy network training method and energy storage control method based on reinforcement learning

Info

Publication number: CN117595346B
Application number: CN202410072211.XA
Authority: CN
Inventors: 那琼澜; 李信; 邢宁哲; 王艺霏; 陈重韬; 邬小波; 曹良晶; 马跃; 彭柏; 杨峰; 娄竞; 王东升; 李坚; 吴佳; 李莉; 张海明
Original assignee: State Grid Corp of China SGCC; State Grid Jibei Electric Power Co Ltd; Information and Telecommunication Branch of State Grid Jibei Electric Power Co Ltd
Current assignee: State Grid Corp of China SGCC; State Grid Jibei Electric Power Co Ltd; Information and Telecommunication Branch of State Grid Jibei Electric Power Co Ltd
Priority date: 2024-01-18
Filing date: 2024-01-18
Publication date: 2024-04-05
Anticipated expiration: 2044-01-18
Also published as: CN117595346A

Abstract

The embodiment of the specification provides a charge-discharge strategy network training method and an energy storage control method based on reinforcement learning, which comprises the steps of constructing a charge-discharge sequential decision model; acquisition of the firstkThe unit price of electricity, the power load of the user and the charge state of the energy storage battery in the time period are taken as the firstkStatus of the time period; according to the firstkDetermining the state and charge-discharge sequential decision model of the time periodkThe charge and discharge power action of the energy storage battery in the time period; according to the firstkThe charge and discharge power action of the time period and a preset rewarding function are calculated to obtain the firstkRewards for a time period, rewards functions including benefit rewards, degradation rewards, and load balancing rewards; by the firstkAnd rewarding the training model in the time period until the training is completed to obtain the charge-discharge strategy network. The method builds a charge-discharge sequential decision model based on reinforcement learning, designs a reward function considering the performance degradation factors of the energy storage battery, fully utilizes the peak clipping and valley filling capabilities of the energy storage battery, and reduces the battery capacity loss and the energy loss.

Description

Charge-discharge strategy network training method and energy storage control method based on reinforcement learning

Technical Field

The embodiment of the specification relates to the technical field of energy storage management and control, in particular to a charge-discharge strategy network training method and an energy storage control method based on reinforcement learning.

Background

Along with the continuous adjustment of the structure of the power supply side, the proportion of renewable energy sources connected into a power grid is continuously increased, so that peak-valley load difference is further increased, and the problem of mismatching between power supply and power demand is aggravated. In this regard, a customer-side energy storage technology has been proposed that enables efficient use of energy by storing excess power at the customer side and releasing it when needed.

The energy storage at the user side is realized through an energy storage battery, and the capacity and performance of the energy storage battery can be degraded due to continuous high-power charge and discharge of the energy storage battery. The existing method does not consider the performance degradation of the energy storage battery and the real-time charge and discharge control of the user side energy storage device under the time-varying user power load, and the operation control of the user side energy storage needs to be further optimized.

Disclosure of Invention

Aiming at the problems in the prior art, an object of an embodiment of the present disclosure is to provide a charge-discharge strategy network training method and an energy storage control method based on reinforcement learning, so as to solve the problems in the prior art that the energy storage control at the user side is inaccurate and energy is wasted due to factors that do not consider the degradation of battery performance and the time-varying user power load.

In order to solve the above technical problems, the specific technical solutions of the embodiments of the present specification are as follows:

in a first aspect, embodiments of the present disclosure provide a charge-discharge strategy network training method based on reinforcement learning, including:

constructing a charge-discharge sequential decision model of the energy storage battery at the user side;

acquisition of the firstkThe unit price of electricity, the power load of users and the charge state of the energy storage battery in a time period are used as the first energy storage batterykStatus of the time period;

according to the firstkThe state of the time period and the charge-discharge sequential decision model are determinedkThe charge and discharge power action of the energy storage battery in the time period;

according to the firstkThe charge and discharge power action of the time period and a preset rewarding function are calculated to obtain the firstkA reward for a time period, the reward function including a benefit reward considering a price of electricity, a degradation reward considering degradation of a capacity of the energy storage battery, and a load balancing reward;

by using the firstkTraining the charge-discharge sequential decision model for a rewarding iteration of a time period untilAnd (5) training to obtain a charge-discharge strategy network.

In particular, utilize the firstkTraining the charge-discharge sequential decision model in a rewarding iteration of a time period until the training is completed to obtain a charge-discharge strategy network, wherein the method comprises the following steps of:

according to the firstkTraining the charge-discharge sequential decision model by rewarding in a time period to obtain the firstkA state of +1 time period;

according to the firstkState of +1 time period and the charge-discharge sequential decision model, determining the firstkCharging and discharging power actions of the energy storage battery in the +1 time period;

according to the firstkThe charge and discharge power action of the energy storage battery in the +1 time period and a preset reward function are calculated to obtain the first timekAwards for the +1 time period;

judging the firstkWhether the reward for the +1 time period satisfies a predetermined condition;

if yes, outputting the charge-discharge sequential decision model as the charge-discharge strategy network;

and if not, repeating the steps to iteratively update the charge-discharge sequential decision model.

Preferably, the reward function is:

r _k = ω ₁ r _b (k) + ω ₂ r _s (k) + ω ₃ r _a (k)

wherein,r _k is the firstkRewards for time periods;ω ₁ 、ω ₂ andω ₃ is a weight factor;r _b (k) Is the firstkBenefit rewards of time periods;r _s (k) Is the firstkLoad balancing rewards of time periods;r _a (k) Is the firstkA time period of energy storage battery degradation rewards;

the benefit rewards are as follows:

wherein,P _demand (i) Is the firstiA consumer power load for a time period;T _d a time interval that is a time period;e _i is the firstiThe electricity unit price of the time period;P _b (i) Is the firstiCharging and discharging power of the energy storage battery in a time period;Nis the number of time periods;

the load balancing rewards are as follows:

r _s (k) = P _b (k) - P _b (k-1)

wherein,P _b(k) AndP _b(k-1) are respectively the firstkTime period and the firstk-Charging and discharging power of the energy storage battery in the period 1;

the degradation rewards of the energy storage battery are as follows:

wherein,Mis a pre-exponential factor;C _rate is the charge-discharge rate; sigma is a power exponent;Ris a universal gas constant;Tabsolute temperature;A(C _rate ) Is ampere-hour throughput.

Specifically, the ampere-hour throughput is:

wherein,t _k is the firstkThe moment of the time period;τrepresenting time;i(τ) Is the firstτCurrent at time;Q _b is the rated capacity of the energy storage battery.

Specifically, the firstkCharging of energy storage cells for a period of timeThe state is obtained by the following charge-discharge model:

wherein,SOC(k) Is the firstkThe state of charge of the energy storage battery for each time period;SOC ₀ the state of charge of the energy storage battery is the initial period of time;P _b (i) Is the firstiThe charging and discharging power of the energy storage battery in the time period, when the energy storage battery is discharged,P _b (i)>0, when the energy storage battery is charged,P _b (i)<0；T _d time intervals of each time period;V _b open circuit voltage of the energy storage battery;Q _b is the rated capacity of the energy storage battery.

Further, the firstkThe method comprises the steps that charging and discharging power actions of an energy storage battery at a user side in a time period are located in an action set, and the action set is as follows:

wherein,Ais an action set;P _b the electric load is used for a user;and->Maximum charging power and maximum discharging power respectively;N _a is the number of actions in the action set.

Further, the first step is judgedkWhether the reward for the +1 time period satisfies a predetermined condition is:

judging the firstkAwards sum and the first of +1 time periodkWhether rewards of other time periods of the +1 time period sequence are within a preset difference range, and the first time periodkThe reward for the +1 time period is greater than the rewards for the other multiple time periods.

In a second aspect, an embodiment of the present disclosure provides an energy storage control method, where the energy storage control method applies a charge-discharge strategy network obtained by training by using the charge-discharge strategy network training method based on reinforcement learning provided by the above technical solution, and the energy storage control method includes:

acquiring electricity utilization unit price, user power load and state of charge of an energy storage battery in a current time period;

and inputting the electricity unit price, the user power load and the charge state of the energy storage battery in the current time period into the charge-discharge strategy network to obtain the charge-discharge power of the energy storage battery at the user side in the current time period output by the charge-discharge strategy network.

In a third aspect, embodiments of the present disclosure provide a charge-discharge strategy network training device based on reinforcement learning, including:

the model construction module is used for constructing a charge-discharge sequential decision model of the user side energy storage battery;

a state acquisition module for acquiring the firstkThe electricity utilization unit price, the user power load and the charge state of the energy storage battery in the time period are taken as the charge-discharge sequential decision modelkStatus of the time period;

a charge-discharge power action determining module for determining the charge-discharge power according to the first stepkThe state of the time period and the charge-discharge sequential decision model are determinedkThe charge and discharge power action of the energy storage battery in the time period;

a reward calculation module for according to the firstkThe charge and discharge power action of the time period and a preset rewarding function are calculated to obtain the firstkA reward for a time period, the reward function including a benefit reward considering a price of electricity, a degradation reward considering degradation of a capacity of the energy storage battery, and a load balancing reward;

a training module for utilizing the firstkAnd training the charge-discharge sequential decision model through rewarding iteration of the time period until training is completed to obtain a charge-discharge strategy network.

In a fourth aspect, embodiments of the present disclosure provide a computer device, including a memory, a processor, and a computer program stored on the memory and capable of running on the processor, where the processor implements the reinforcement learning-based charge-discharge strategy network training method or the energy storage control method provided in the foregoing technical solution when executing the computer program.

By adopting the technical scheme, the charge-discharge strategy network training method and the energy storage control method based on reinforcement learning provided by the embodiment of the specification construct a charge-discharge sequential decision model of the user side energy storage battery based on the reinforcement learning method, design benefit rewards, load balancing rewards and energy storage battery degradation rewards considering performance degradation factors of the user side energy storage battery in the operation process, fully utilize peak clipping and valley filling capacity of the user side energy storage, reduce electricity load variance and battery capacity loss, improve power supply economy and reduce energy loss.

The foregoing and other objects, features and advantages of the embodiments of the invention will be apparent from the following more particular description of the preferred embodiments, as illustrated in the accompanying drawings.

Drawings

In order to more clearly illustrate the embodiments of the present description or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present description, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 shows a schematic step diagram of a charge-discharge strategy network training method based on reinforcement learning according to an embodiment of the present disclosure;

FIG. 2 shows a schematic diagram of the steps of iterative training of a charge-discharge sequential decision model;

FIG. 3 is a schematic diagram illustrating steps of an energy storage control method according to an embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of a charge-discharge strategy network training device based on reinforcement learning according to an embodiment of the present disclosure;

fig. 5 shows a schematic step diagram of an energy storage control device according to an embodiment of the present disclosure;

fig. 6 shows a schematic structural diagram of a computer device according to an embodiment of the present disclosure.

Description of the drawings:

41. a model building module;

42. a state acquisition module;

43. a charge-discharge power action determining module;

44. a reward calculation module;

45. a training module;

51. an acquisition module;

52. a charge-discharge power obtaining module;

602. a computer device;

604. a processor;

606. a memory;

608. a driving mechanism;

610. an input/output module;

612. an input device;

614. an output device;

616. a presentation device;

618. a graphical user interface;

620. a network interface;

622. a communication link;

624. a communication bus.

Detailed Description

The technical solutions of the embodiments of the present specification will be clearly and completely described below with reference to the drawings in the embodiments of the present specification, and it is apparent that the described embodiments are only some embodiments of the present specification, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are intended to be within the scope of the present disclosure.

It should be noted that the terms "first," "second," and the like in the description and the claims, and in the foregoing figures, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the present description described herein may be capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, apparatus, article, or device that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or device.

The embodiment of the specification provides a charge-discharge strategy network training method and an energy storage control method based on reinforcement learning, which can solve the problems of inaccurate energy storage control at a user side and energy waste caused by the fact that battery performance degradation and time-varying factors of power load of the user are not considered in the prior art. Fig. 1 is a schematic diagram of steps of a charge-discharge strategy network training method based on reinforcement learning provided in an embodiment of the present disclosure, which provides the method operation steps described in the examples or flowcharts, but may include more or fewer operation steps based on conventional or non-inventive labor. The order of steps recited in the embodiments is merely one way of performing the order of steps and does not represent a unique order of execution. When a system or apparatus product in practice is executed, it may be executed sequentially or in parallel according to the method shown in the embodiments or the drawings. Specifically, as shown in fig. 1, the method for training the charge-discharge strategy network based on reinforcement learning may include:

s110: and constructing a charge-discharge sequential decision model of the user side energy storage battery.

S120: acquisition of the firstkThe unit price of electricity, the power load of users and the charge state of the energy storage battery in a time period are used as the first energy storage batterykStatus of the time period.

Record the firstkThe electricity unit price of the time period ise _i Is obtained by presetting; recording consumer power loadIs thatP _demand (k) Is determined by the requirement of the user side; and recording the state of charge of the energy storage battery asSOC(k) It was obtained by providing the following charge-discharge model:

wherein,SOC(k) Is the firstkThe state of charge of the energy storage battery for each time period;SOC ₀ for the state of charge of the energy storage battery for the initial period of time, it may be designated that a certain same time of day (e.g., zero time of day) is the initial time;P _b (i) Is the firstiThe charging and discharging power of the energy storage battery in the time period, when the energy storage battery is discharged,P _b (i)>0, when the energy storage battery is charged,P _b (i)<0；T _d time intervals of each time period;V _b open circuit voltage of the energy storage battery;Q _b is the rated capacity of the energy storage battery.

By establishing the charge and discharge model of the energy storage battery according to the formula, the state of charge of the energy storage battery in each time period can be calculated more accurately.

S130: according to the firstkThe state of the time period and the charge-discharge sequential decision model are determinedkAnd (3) performing charge and discharge power actions of the energy storage battery in a time period.

S140: according to the firstkThe charge and discharge power action of the time period and a preset rewarding function are calculated to obtain the firstkAnd rewarding the time period, wherein the rewarding function comprises benefit rewards considering electricity unit price, degradation rewards considering degradation of capacity of the energy storage battery and load balancing rewards.

S150: by using the firstkAnd training the charge-discharge sequential decision model through rewarding iteration of the time period until training is completed to obtain a charge-discharge strategy network.

According to the charge-discharge strategy network training method based on reinforcement learning, performance degradation factors of the energy storage battery at the user side in the operation process are considered, a charge-discharge sequential decision model of the energy storage battery at the user side is built based on the reinforcement learning method, through design benefit rewards, load balancing rewards and energy storage battery degradation rewards, peak clipping and valley filling capabilities of energy storage at the user side are fully utilized, the energy storage battery is charged in a period of electricity consumption valley and low electricity price, the energy storage battery is discharged in a period of electricity consumption peak and high electricity price, electricity consumption load variance and battery capacity loss are reduced, power supply economy is improved, and energy loss is reduced.

In particular, the sequential decision process may be expressed as:

(…, s _k , a _k , r _k , s _k+1 , a _k+1 , r _k+1 …)

wherein,s _k ，a _k andr _k respectively the firstkStatus, action, and rewards of the energy storage battery for the time period;s _k+1 ，a _k+1 andr _k+1 respectively the firstkState, action and rewards of the energy storage battery for the +1 time period.

Namely, as shown in fig. 2, step S150: by using the firstkAnd training the charge-discharge sequential decision model in a reward iteration of a time period until the training is completed to obtain a charge-discharge strategy network, wherein the charge-discharge strategy network comprises the following concrete steps:

s210: according to the firstkAwarding a time period, updating the energy storage battery in the first time periodkState of +1 period.

I.e. using rewardsr _k Updating to obtain the state of the energy storage batterys _k+1 。

S220: according to the firstkState of +1 time period and the charge-discharge sequential decision model, determining the firstkAnd (3) charging and discharging power action of the energy storage battery in the +1 time period.

State immediately after updates _k+1 Inputting into a charge-discharge sequential decision model to obtain the energy storage battery at the first stagek+1 timeCharge-discharge power action taken by etchinga _k+1 。

S230: according to the firstkThe charge and discharge power action of the energy storage battery in the +1 time period and a preset reward function are calculated to obtain the first timekAwards for the +1 time period.

I.e. according to charge-discharge powera _k+1 And a charge-discharge strategy network, which is calculated to obtainr _k+1 。

S240: judging the firstkWhether the reward for the +1 time period satisfies a predetermined condition.

In some specific embodiments, the first is determinedkWhether the reward for the +1 time period satisfies the predetermined condition is:

judging the firstkAwards of +1 time period and the firstkWhether rewards of other time periods of the +1 time period sequence are within a preset difference range, and the first time periodkThe reward for the +1 time period is greater than the rewards for the other multiple time periods. Namely judging whether rewards of the charge-discharge sequential decision model reach the maximum and stable after performing reinforcement learning on the charge-discharge sequential decision model for a plurality of iterations.

S250: if yes, outputting the charge-discharge sequential decision model as the charge-discharge strategy network.

Illustratively, determining whether the prize for an nth time period of the n sequentially adjacent time periods is a maximum of the n time period prizes; and the difference value between rewards of the n time periods is within a preset difference threshold value range, if yes, the rewards are stabilized, and the charge-discharge sequential decision model is trained. If the charge-discharge sequential decision model is iteratively trained again and the reward of the (n+1) th time period is obtained, wherein the reward of the (n+1) th time period is smaller than the reward of the (n+1) th time period but the rewards of the (n+1) th time period and the reward of the (n) th time period are still within the preset difference threshold value range, the charge-discharge sequential decision model corresponding to the (n) th time period can be output as the charge-discharge strategy network.

If not, jumping to the step S210 to repeat the steps, and carrying out iterative updating on the charge-discharge sequential decision model.

Indicating that the charge-discharge sequential decision model is not trained, and rewardingr _k+1 Updating the obtained states _k+2 The method comprises the steps of carrying out a first treatment on the surface of the Thereby bringing the state intos _k+2 Inputting the energy storage battery into the charge-discharge sequential decision model again, and determining that the energy storage battery is at the first stagek+2 charge-discharge power action taken at timea _k+2 And calculate and get rewardsr _k+2 . Judging rewards againr _k+2 And (5) whether a preset condition is met or not, and iterating until the updating of the charge-discharge sequential decision model is completed.

It should be noted that, in the reinforcement learning process, when the reward is greater than a predetermined value (such as zero value), it is indicated that the reward plays a role in forward guiding the update of the charge-discharge strategy network; when the reward is smaller than or equal to a preset value, the reward is indicated to play a role in reverse direction on updating of the charge-discharge strategy network, so that learning of the charge-discharge strategy network is guided gradually.

In the present embodiment, the state of reinforcement learning may be expressed as:

s _k ={P _demand (k),SOC(k), e _k }

wherein,s _k is the firstkStatus of the time period;P _demand (k) Is the firstkA consumer power load for a time period;SOC(k) Is the firstkThe state of charge of the energy storage battery during the time period;e _k is the firstkThe electricity unit price of the time period.

The action set of reinforcement learning is:

wherein,Ais an action set;and->Maximum charging power and maximum discharging power respectively;N _a is the number of actions in the action set.

The rewards of reinforcement learning are:

r _k= ω ₁ r _b (k) + ω ₂ r _s (k) + ω ₃ r _a (k)

wherein,r _k is the firstkRewards for time periods;ω ₁ ，ω ₂ andω ₃ is a weight factor;r _b (k) Is the firstkBenefit rewards of time periods;r _s (k) Is the firstkLoad balancing rewards of time periods;r _a (k) Is the firstkThe energy storage battery for a period of time is awarded for degradation.

In some preferred embodiments, the benefit rewards may be:

wherein,P _demand (i) Is the firstiA consumer power load for a time period;T _d a time interval that is a time period;e _i is the firstiThe electricity unit price of the time period;P _b (i) Is the firstiCharging and discharging power of the energy storage battery in a time period;Nis the number of time periods. The time period is obtained by uniformly dividing 24 hours in one day, and optionally,Nmay be of a value of 24 to 48, i.e. the time interval of each time periodT _d And may be 1 hour to 0.5 hour. By selecting the proper time period quantity and time period duration, the workload and the efficiency of iterative updating of the charge-discharge sequential decision model can be considered.

From the expression of the benefit rewardsThe benefit rewards are end rewards, and only the last time period in the day obtained by dividing is the benefit rewards not zero, and the benefit rewards of the last time period are relevant in all other time periods before the benefit rewards. This is in consideration of the hysteresis of the benefit rewards, in the firstNThe rewards of the charge-discharge power actions performed by the energy storage battery are not immediately applied to the time period immediately following each time period before that time period. By designing the efficiency rewards shown in the formula, the accuracy of iterative training of the charge-discharge sequential decision model by using the efficiency is improved.

The load balancing rewards are as follows:

r _s (k) = P _b (k) - P _b (k-1)

wherein,P _b(k) AndP _b(k-1) are respectively the firstkTime period and the firstk-And charging and discharging power of the energy storage battery in the period 1.

By designing the load balancing rewards according to the formula, the peak clipping and valley filling capacity of the energy storage at the user side can be fully considered, so that the energy storage battery is charged in the time period of low electricity consumption and low electricity price, and the energy storage battery is used for discharging in the time period of high electricity consumption and high electricity price, so that the electricity consumption load variance is reduced.

Further, the energy storage battery degradation rewards are:

The ampere-hour throughput is as follows:

wherein,t _k is the firstkThe moment of the time period;τrepresenting time;i(τ) Is the firstτThe current at the moment can be obtained through measurement and is related to the charging and discharging power actions of the energy storage battery in the corresponding time period;Q _b is the rated capacity of the energy storage battery.

By designing the energy storage battery degradation rewards according to the formula, the factors of the energy storage battery performance degradation are considered, wherein the performance degradation is that the battery capacity is reduced along with the continuous high-power charge and discharge operation of the energy storage battery, so that a charge and discharge control strategy capable of slowing down the battery performance degradation is expected to be obtained through reinforcement learning. Finally, according to the embodiment of the specification, through designing the reward functions of the benefit rewards considering the electricity unit price, the degradation rewards considering the capacity degradation of the energy storage battery and the load balancing rewards, the charge-discharge sequential decision model of the user side energy storage battery which is more accurately constructed can be constructed, and finally, the charge-discharge strategy network can be obtained to control the charge-discharge power of each time period of the user side energy storage battery more accurately, so that the capacity loss of the battery is reduced, the power supply economy is improved, and the energy waste is reduced.

As shown in fig. 3, the embodiment of the present disclosure further provides an energy storage control method, where the energy storage control method applies a charge-discharge strategy network obtained by training by using the charge-discharge strategy network training method based on reinforcement learning provided by the above technical solution, and the energy storage control method includes:

s310: and acquiring the electricity utilization unit price, the user power load and the state of charge of the energy storage battery in the current time period.

S320: and inputting the electricity unit price, the user power load and the charge state of the energy storage battery in the current time period into the charge-discharge strategy network to obtain the charge-discharge power of the energy storage battery at the user side in the current time period output by the charge-discharge strategy network.

The user power load, the charge state of the energy storage battery and the electricity unit price in the current time period are used as the input of the charge-discharge strategy network,the charge-discharge strategy network outputs the optimal charge-discharge power of the user side energy storage device in the current time period。

Based on the charge and discharge strategy network training method based on reinforcement learning, the embodiment of the specification correspondingly provides a charge and discharge strategy network training device based on reinforcement learning; based on the above-mentioned energy storage control method, the embodiment of the present disclosure further provides an energy storage control device correspondingly. The provided devices may include systems (including distributed systems), software (applications), modules, components, servers, clients, etc. that employ the methods described in embodiments of the present specification in combination with the necessary devices to implement the hardware. Based on the same innovative concepts, the embodiments of the present description provide means in one or more embodiments as described in the following embodiments. Because the implementation scheme and the method for solving the problem by the device are similar, the implementation of the device in the embodiment of the present disclosure may refer to the implementation of the foregoing method, and the repetition is not repeated. As used below, the term "unit" or "module" may be a combination of software and/or hardware that implements the intended function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.

As shown in fig. 4, a charge-discharge strategy network training device based on reinforcement learning provided in the embodiment of the present disclosure includes:

the model construction module 41 is used for constructing a charge-discharge sequential decision model of the user side energy storage battery;

a state acquisition module 42 for acquiring the firstkThe electricity utilization unit price, the user power load and the charge state of the energy storage battery in the time period are taken as the charge-discharge sequential decision modelkStatus of the time period;

a charge-discharge power action determining module 43 for determining the charge-discharge power according to the firstkThe state of the time period and the charge-discharge sequential decision model are determinedkCharging and discharging work of energy storage battery in time periodRate action;

a reward calculation module 44 for, according to the firstkThe charge and discharge power action of the time period and a preset rewarding function are calculated to obtain the firstkA reward for a time period, the reward function including a benefit reward considering a price of electricity, a degradation reward considering degradation of a capacity of the energy storage battery, and a load balancing reward;

training module 45 for utilizing the firstkAnd training the charge-discharge sequential decision model through rewarding iteration of the time period until training is completed to obtain a charge-discharge strategy network.

As shown in fig. 5, the energy storage control device provided in the embodiment of the present disclosure includes:

an obtaining module 51, configured to obtain a unit price of electricity, a power load of a user, and a state of charge of an energy storage battery in a current period;

the charge-discharge power obtaining module 52 is configured to input the electricity unit price, the user power load, and the state of charge of the energy storage battery to a policy network, so as to obtain the charge-discharge power of the user side energy storage battery in the current time period; the strategy network is obtained by training a charge-discharge sequential decision model of the energy storage battery at the user side based on a reinforcement learning method.

As shown in fig. 6, a computer device provided in the embodiments of the present disclosure may be a computer device in the embodiments of the present disclosure, where the charge-discharge strategy network training device or the energy storage control device based on reinforcement learning in the present disclosure performs the above-mentioned method of the present disclosure. The computer device 602 may include one or more processors 604, such as one or more Central Processing Units (CPUs), each of which may implement one or more hardware threads. The computer device 602 may also include any memory 606 for storing any kind of information, such as code, settings, data, etc. For example, and without limitation, memory 606 may include any one or more of the following combinations: any type of RAM, any type of ROM, flash memory devices, hard disks, optical disks, etc. More generally, any memory may store information using any technique. Further, any memory may provide volatile or non-volatile retention of information. Further, any memory may represent fixed or removable components of computer device 602. In one case, when the processor 604 executes associated instructions stored in any memory or combination of memories, the computer device 602 can perform any of the operations of the associated instructions. The computer device 602 also includes one or more drive mechanisms 608, such as a hard disk drive mechanism, an optical disk drive mechanism, and the like, for interacting with any memory.

The computer device 602 may also include an input/output module 610 (I/O) for receiving various inputs (via an input device 612) and for providing various outputs (via an output device 614). One particular output mechanism may include a presentation device 616 and an associated Graphical User Interface (GUI) 618. In other embodiments, input/output module 610 (I/O), input device 612, and output device 614 may not be included, but may be implemented as a single computer device in a network. The computer device 602 may also include one or more network interfaces 620 for exchanging data with other devices via one or more communication links 622. One or more communication buses 624 couple the above-described components together.

The communication link 622 may be implemented in any manner, for example, through a local area network, a wide area network (e.g., the internet), a point-to-point connection, etc., or any combination thereof. Communication link 622 may include any combination of hardwired links, wireless links, routers, gateway functions, name servers, etc., governed by any protocol or combination of protocols.

Corresponding to the method as shown in fig. 1 to 3, the present embodiment also provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the above method.

The present embodiments also provide a computer readable instruction wherein the program therein causes the processor to perform the method as shown in fig. 1 to 3 when the processor executes the instruction.

The present description also provides a computer program product comprising at least one instruction or at least one program loaded into and executed by a processor to implement the method as shown in fig. 1-3.

It should be understood that, in various embodiments of the present disclosure, the sequence numbers of the foregoing processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic thereof, and should not constitute any limitation on the implementation of the embodiments of the present disclosure.

It should also be understood that, in the embodiments of the present specification, the term "and/or" is merely one association relationship describing the association object, meaning that three relationships may exist. For example, a and/or B may represent: a exists alone, A and B exist together, and B exists alone. In the present specification, the character "/" generally indicates that the front and rear related objects are an or relationship.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the various example components and steps have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present specification.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.

In the several embodiments provided in this specification, it should be understood that the disclosed systems, apparatuses, and methods may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. In addition, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices, or elements, or may be an electrical, mechanical, or other form of connection.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purposes of the embodiments of the present description.

In addition, each functional unit in each embodiment of the present specification may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on this understanding, the technical solution of the present specification is essentially or a part contributing to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in the embodiments of the present specification. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The principles and embodiments of the present specification are explained in this specification using specific examples, the above examples being provided only to assist in understanding the method of the present specification and its core ideas; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope based on the ideas of the present specification, the present description should not be construed as limiting the present specification in view of the above.

Claims

1. A charge-discharge strategy network training method based on reinforcement learning is characterized by comprising the following steps:

according to the firstkThe charge and discharge power action of the time period and a preset rewarding function are calculated to obtain the firstkA reward for a time period, the reward function including a benefit reward considering a price of electricity, a degradation reward considering degradation of a capacity of the energy storage battery, and a load balancing reward, the reward function being:

r _k =ω ₁ r _b (k)+ω ₂ r _s (k)+ω ₃ r _a (k)

wherein,r _k is the firstkRewards for time periods;ω ₁ 、ω ₂ andω ₃ is a weight factor;r _b (k) Is the firstkBenefit rewards of time periods;r _s (k) Is the firstkLoad balancing rewards of time periods;r _a (k) Is the firstkTimeThe energy storage battery of the segment degenerates and rewards;

the benefit rewards are as follows:

wherein,P _demand （i) Is the firstiA consumer power load for a time period;T _d a time interval that is a time period;e _i is the firstiThe electricity unit price of the time period;P _b (i) Is the firstiCharging and discharging power of the energy storage battery in a time period;Nis the number of time periods;

the load balancing rewards are as follows:

r _s (k)=P _b (k)-P _b (k-1)

wherein,P _b (k) AndP _b (k-1) are respectively the firstkTime period and the firstk-Charging and discharging power of the energy storage battery in the period 1;

the degradation rewards of the energy storage battery are as follows:

wherein,Mis a pre-exponential factor;C _rate is the charge-discharge rate; sigma is a power exponent;Ris a universal gas constant;Tabsolute temperature;A(C _rate ) Is ampere-hour throughput;

by using the firstkAnd training the charge-discharge sequential decision model through rewarding iteration of the time period until training is completed to obtain a charge-discharge strategy network.

2. The method according to claim 1, characterized by using the firstkTraining the charge-discharge sequential decisions with rewarding iterations of a time periodAnd (3) modeling until training is completed to obtain a charge-discharge strategy network, and further:

3. The method of claim 1, wherein the ampere-hour throughput is:

4. The method according to claim 1, wherein the firstkThe state of charge of the energy storage battery in the time period is obtained through the following charge-discharge model:

wherein,SOC(k) Is the firstkTime of eachThe state of charge of the segment energy storage battery;SOC ₀ the state of charge of the energy storage battery is the initial period of time;P _b (i) Is the firstiThe charging and discharging power of the energy storage battery in the time period, when the energy storage battery is discharged,P _b (i)>0, when the energy storage battery is charged,P _b (i)<0；T _d time intervals of each time period;V _b open circuit voltage of the energy storage battery;Q _b is the rated capacity of the energy storage battery.

5. The method according to claim 1, wherein the firstkThe method comprises the steps that charging and discharging power actions of an energy storage battery at a user side in a time period are located in an action set, and the action set is as follows:

6. The method according to claim 2, wherein the determination of the firstkWhether the reward for the +1 time period satisfies a predetermined condition is further:

7. An energy storage control method, characterized in that the energy storage control method applies the charge-discharge strategy network trained by the reinforcement learning-based charge-discharge strategy network training method according to any one of claims 1 to 6, and the energy storage control method comprises:

8. The utility model provides a charge-discharge strategy network trainer based on reinforcement study which characterized in that includes:

a reward calculation module for according to the firstkThe charge and discharge power action of the time period and a preset rewarding function are calculated to obtain the firstkA reward for a time period, the reward function including a benefit reward considering a price of electricity, a degradation reward considering degradation of a capacity of the energy storage battery, and a load balancing reward,r _k =ω ₁ r _b (k)+ω ₂ r _s (k)+ω ₃ r _a (k)

the benefit rewards are as follows:

the load balancing rewards are as follows:

r _s (k)=P _b (k)-P _b (k-1)

the degradation rewards of the energy storage battery are as follows:

a training module for utilizing the firstkTraining the charge for a bonus iteration of a time periodAnd discharging the sequential decision model until training is completed to obtain a charging and discharging strategy network.

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any one of claims 1 to 7 when the computer program is executed.