CN115358528A

CN115358528A - Battery energy storage capacity estimation method and system based on reinforcement learning algorithm

Info

Publication number: CN115358528A
Application number: CN202210868577.9A
Authority: CN
Inventors: 李昕; 孙一凫
Original assignee: Borui Shangge Technology Co ltd
Current assignee: Borui Shangge Technology Co ltd
Priority date: 2022-07-22
Filing date: 2022-07-22
Publication date: 2022-11-18

Abstract

The invention relates to a method and a system for estimating the energy storage capacity of a battery based on a reinforcement learning algorithm. The method comprises the following steps: according to strategy networks under different battery capacities, simulating the charging and discharging actions of the energy storage battery after the energy storage battery is added to obtain an optimal strategy, and calculating the electricity charge saved each year; calculating the internal rate of return on investment for different battery capacities; within the range of acceptable internal profitability, calculating financial cost in the return on investment period under different battery capacities according to the loan interest rate; calculating profits under different battery capacities according to the initial investment and the financial cost; and outputting the battery capacity with the highest profit as a final battery capacity selection scheme. The invention can assist in making decisions of energy storage schemes, can give out accurate calculation results to support investment decisions based on historical data, can flexibly change various conditions, can give out accurate calculation results of corresponding changes, and can give out calculation uncertainty and investment return risks based on characteristic setting of power price fluctuation.

Description

Battery energy storage capacity estimation method and system based on reinforcement learning algorithm

Technical Field

The invention relates to the field of energy storage design of a power demand side, in particular to a battery energy storage capacity estimation method and system based on a reinforcement learning algorithm.

Background

With the reform of the power market, the future price of electricity is determined by the competition game between the electricity generation and the electricity utilization. The electricity prices will change in time and zone in real time (every 15 minutes), and the fluctuation of electricity prices will increase greatly. Therefore, the fluctuation electricity price is fully utilized to adjust the electricity utilization condition in the building, so that the electricity utilization side is helped to reduce the total electricity cost, and the peak electricity pressure is helped to be reduced for power generation. The improvement increase of the demand side energy storage can bring considerable flexibility to the building electricity regulation, but the cost of the energy storage battery is not low at present, so that the economic value brought by the energy storage design scheme needs to be accurately evaluated to assist in judging the energy storage improvement decision. The general energy storage design evaluation method has poor accuracy, so that the investment report is not high or even negative. The invention gives quantitative auxiliary decision to each building example and different external conditions by fully utilizing the historical real-time operation data and the historical real-time fluctuation electricity price of the building.

Disclosure of Invention

In order to overcome the problems in the related art, the invention provides a method and a system for estimating the energy storage capacity of a battery based on a reinforcement learning algorithm.

According to a first aspect of the embodiments of the present invention, a method for estimating a battery energy storage capacity based on a reinforcement learning algorithm is provided, including:

simulating the charging and discharging actions of the energy storage battery after the energy storage battery is added according to the strategy network under different battery capacities to obtain an optimal strategy, and calculating the electricity charge saved each year under the optimal strategy;

taking the annual saved electric charge as net cash flow, and calculating the internal rate of return on investment under different battery capacities;

calculating financial costs in the return on investment period for different battery capacities within an acceptable internal rate of return according to loan rates;

calculating profits under different battery capacities according to the initial investment and the financial cost;

and outputting the battery capacity with the highest profit as a final battery capacity selection scheme.

Further, the method also comprises a step of obtaining the strategy network by adopting Q-Learning algorithm training of reinforcement Learning, and specifically comprises the following steps:

step 1, initializing algorithm parameters and a Q table;

step 2, inputting an environment state, and inquiring all actions which can be taken by the battery;

step 3, inquiring the Q value of each action which can be taken by the battery in the current state in a Q table, and selecting the action according to the inquired Q value;

step 4, calculating the reward according to the selected action;

step 5, updating the electricity price in the battery;

step 6, updating the Q table;

step 7, updating the energy storage state and the environment state of the battery;

and repeatedly executing the steps 2-7 until the algorithm converges.

Further, in step 3, selecting an action according to the queried Q value specifically includes:

if the Q values are all 0, giving the same probability to all actions which can be taken, and then selecting the action according to the probability; otherwise, the probability of the action with the highest Q value is increased, and then the action is selected according to the probability.

Further, the step 4 specifically includes:

in the discharging state of the battery, if the electricity price in the battery is higher than the electricity price of the power grid, negative reward is obtained through calculation, and if the electricity price in the battery is not higher than the electricity price of the power grid, positive reward is obtained through calculation;

under the charging state of the battery, if the electricity price in the battery is not more than the electricity price of the power grid, positive reward is obtained through calculation, otherwise, reward is not carried out;

in a state where the battery is neither charged nor discharged, no prize is awarded.

Further, the internal rate of return is calculated as:

wherein NPV is net present value, CF _i The net cash flow of the ith year, n the return on investment period, and IRR the internal rate of return.

According to a second aspect of the embodiments of the present invention, there is provided a battery energy storage capacity estimation system based on a reinforcement learning algorithm, including:

the first calculation module is used for simulating the charging and discharging actions of the energy storage battery after the energy storage battery is added according to the strategy networks under different battery capacities to obtain an optimal strategy and calculating the electricity charge saved each year under the optimal strategy;

the second calculation module is used for taking the electric charge saved each year as net cash flow and calculating the internal rate of return on investment in different battery capacities;

the third calculation module is used for calculating the financial cost in the return on investment period under different battery capacities within an acceptable internal rate of return according to the loan interest rate;

the fourth calculation module is used for calculating profits under different battery capacities according to the initial investment and the financial cost;

and the scheme selection module is used for outputting the battery capacity with the highest profit as a final battery capacity selection scheme.

Further, the system further comprises a training module for training to obtain the strategy network by adopting a reinforcement Learning Q-Learning algorithm, wherein the training module specifically comprises:

the initialization unit is used for initializing algorithm parameters and a Q table;

the inquiry unit is used for inputting the environment state and inquiring all actions which can be taken by the battery;

the selection unit is used for inquiring the Q value of each action which can be taken by the battery in the current state in the Q table and selecting the action according to the inquired Q value;

a calculation unit for calculating a reward according to the selected action;

a first updating unit for updating the electricity price in the battery;

a second updating unit for updating the Q table;

the third updating unit is used for updating the battery energy storage state and the environment state;

and the calling unit is used for repeatedly calling the initialization unit, the first query unit, the second query unit, the calculation unit, the first updating unit, the second updating unit and the third updating unit until the algorithm is converged.

Further, the selecting unit is specifically configured to:

According to a third aspect of the embodiments of the present invention, there is provided a terminal device, including:

a processor; and

a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method as described above.

According to a fourth aspect of embodiments of the present invention, there is provided a non-transitory machine-readable storage medium having stored thereon executable code which, when executed by a processor of an electronic device, causes the processor to perform a method as described above.

The technical scheme provided by the embodiment of the invention can have the following beneficial effects:

1. the energy storage scheme can be step-by-step assisted to make the decision of the energy storage scheme.

2. Based on historical data of the building demand side, more accurate calculation results can be given for each project to support investment decision.

3. Various conditions (such as power price volatility, unit battery cost, battery model selection, strategy constraint conditions and the like) can be flexibly changed, and accurate calculation results of corresponding changes can be given.

4. Feature settings based on fluctuations in electricity prices (manually set ranges) can give computational uncertainty and return on investment risk.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

Drawings

The above and other objects, features and advantages of the present invention will become more apparent by describing in more detail exemplary embodiments thereof with reference to the attached drawings, in which like reference numerals generally represent like parts throughout.

Fig. 1 is a schematic flow diagram illustrating a method for estimating battery energy storage capacity based on reinforcement learning algorithm according to an exemplary embodiment of the present invention;

FIG. 2 is a schematic diagram of the algorithm of the method of the present invention;

FIG. 3 is a flow chart illustrating the training steps of the Q-Learning algorithm;

FIG. 4 is a graph showing the calculation results of the annual electricity saving rate and percentage of different battery capacities;

FIG. 5 is a graph illustrating the results of internal rate of return calculations for different battery capacities;

FIG. 6 is a schematic of 10 years of break-out for different battery capacities;

fig. 7 is a graph of the total 10-year profit for different battery capacities.

Detailed Description

Preferred embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present invention are shown in the drawings, it should be understood that the present invention may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this disclosure and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It is to be understood that, although the terms first, second, third, etc. may be used herein to describe various information, such information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present invention. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.

The invention relates to an energy storage operation adjusting method which combines real-time operation data of a whole year of a building with real-time electricity price, wherein an energy storage scheme is divided into two steps of decision making, internal yield is calculated firstly, then profit charts of different schemes are given based on the acceptable internal yield, the decision making is provided for people, and the optimal strategy of an energy storage battery is calculated by considering long-term return through a reinforcement learning algorithm.

The technical solutions of the embodiments of the present invention are described in detail below with reference to the accompanying drawings.

As shown in fig. 2, the method principle comprises four links of basic information entry, energy storage battery strategy, profit calculation and decision output.

Firstly, before making a decision on a battery energy storage scheme, two types of basic information need to be input, namely basic condition setting including battery specifications (battery theoretical charge-discharge depth, battery theoretical charge-discharge cycle efficiency, battery theoretical charge-discharge power and the like) and time-of-use price, and secondly energy storage battery selection range setting, mainly including upper and lower limits of battery capacity.

In the energy storage battery strategy link, a reinforcement Learning algorithm is mainly adopted, in this embodiment, a Q-Learning algorithm can be adopted, and the algorithm is to regard the energy storage battery as an intelligent agent and perform autonomous Learning in an unknown environment. A state is initially given, an action is then taken, and a reward (which may be positive or negative) is given after each action is completed, while refreshing to a new state and making a new action. Through continuous circulation, the intelligent agent can select the optimal action under different conditions by utilizing the learned rule. The core of the Q-Learning algorithm is to establish a state-action value Q table, continuously study with the environment in an interactive way, update the Q table, finally sum up a Q table, record the rewards of different actions in different states, inquire the Q table in a new state and find the action with the highest reward.

When the reinforcement learning algorithm is trained, data needs to be processed, and since data such as historical power consumption, battery capacity and the like are continuous, the states are too many, and the method using the Q table cannot be completely stored, the states and the decision space of the problem need to be discretized in advance when the random optimal control problem with the continuous states and the decision space is solved. Discretizing historical electricity consumption, battery capacity and electricity price data, and performing box separation according to corresponding step length.

In addition, the algorithm needs to be defined as follows:

defining an agent: the energy storage batteries with different capacities are intelligent agents, and different actions can be performed under different conditions.

Defining an environment: the time-sharing historical energy utilization condition in a building within a period of time is an environment, and the environment comprises time-of-use electricity price, limits (input and output power, upper and lower limits of capacity) of an energy storage battery and the like.

Defining an action: a certain decision made by the intelligent agent comprises the steps of charging the energy storage battery, discharging the energy storage battery and keeping the energy storage battery in a standby state in the model.

Defining an action space: and (4) grouping the charging and discharging power of the energy storage battery in the model to obtain an action space.

And (3) state: the summary of the current environment comprises the electric quantity of an energy storage battery, the time-of-use electricity price and the time-of-use electricity consumption of a building.

State space: a set of all possible states.

The strategy function is as follows: and outputting the next action by the intelligent agent in the corresponding state.

Rewarding: the energy storage battery saves the electricity fee under the time-of-use electricity price.

Policy network: and the corresponding relation table of the action and the reward.

Specifically, as shown in fig. 3, the training step of the Q-Learning algorithm includes:

s1, initializing algorithm parameters and a Q table;

wherein, the initialization algorithm parameters are as follows: γ =0.95 (the closer γ is to 1, the more sensitive it is to future rewards), α =0.2 (learning rate, the error that determines this time is to be learned), and ε =0.65 (greediness, a strategy used in decision making, e.g., ε =0.9, indicates that 90 percent would select behavior according to the optimal value of the Q-table, and 10 percent would select behavior randomly in time); initializing a Q table: the initial values are all 0.

S2, inputting an environment state, and inquiring all actions which can be taken by the battery;

specifically, the environmental state includes energy storage battery state of charge, price of electricity state, project can use energy state etc. and the action that the battery can take includes:

a. if the battery is not full, the battery can be charged;

b. can be discharged if the battery is not empty;

c. the battery may not be charged nor discharged in any state.

S3, inquiring the Q value of each action which can be taken by the battery in the current state in the Q table, and selecting the action according to the inquired Q value;

specifically, if the Q values are all 0, then the same probability is assigned to all actions that can be taken, and then an action is selected according to the probability; otherwise, the probability of the action with the highest Q value is increased, and then the action is selected according to the probability. In the early learning stage, the selected action is insensitive to the reward size in order to increase the diversity of the training samples and effectively update the Q table. Later, as learning progresses, the action set is selected with a greater probability to win the most exciting action.

S4, calculating rewards according to the selected actions;

specifically, the reward calculation method is as follows: discharging the battery, wherein the electricity price in the battery is higher than that of a power grid, and obtaining negative reward = electricity quantity multiplied by electricity price difference; the electricity price in the battery is not higher than the electricity price of the power grid, and a positive reward = electricity quantity multiplied by electricity price difference is obtained. Charging the battery: the electricity price in the battery is not more than the electricity price of the power grid, positive reward = electricity quantity x electricity price difference is obtained, otherwise no reward is given. The battery is not moved: no prize is awarded.

S5, updating the electricity price in the battery;

s6, updating the Q table;

specifically, based on the status and reward, the Q-table is updated by the following formula:

Q(S _t ,A _t )←Q(S _t ,A _t )+α(R _t+1 +λmax _a Q(S _t+1 ,a)-Q(S _t ,A _t ))

s7, updating the energy storage state and the environment state of the battery;

specifically, the environment state of the next time is obtained by updating.

And (4) repeating the steps S2-S7, fully exploring the intelligent agent in the environment, traversing various possibilities in the environment, continuously updating the Q table, gradually finding the optimal solution, namely selecting correct actions under different states to maximize the saved electric charge, and stopping training after convergence.

Fig. 1 is a flowchart illustrating a battery energy storage capacity estimation method based on a reinforcement learning algorithm according to an exemplary embodiment of the present invention.

Referring to fig. 1, the method includes:

110. simulating the charging and discharging actions of the energy storage battery after the energy storage battery is added according to the strategy network under different battery capacities to obtain an optimal strategy, and calculating the electricity charge saved each year under the optimal strategy;

specifically, the strategy network can be obtained by training the Q-Learning algorithm of reinforcement Learning.

According to the method, the annual electricity saving rate of different battery capacities can be calculated, and further, the annual electricity saving rate percentage of different battery capacities can be calculated, as shown in fig. 4.

120. Taking the annual saved electric charge as net cash flow, and calculating the internal rate of return on investment under different battery capacities;

specifically, the Internal Rate of Return (IRR) is the discount Rate when the total of the present value of the inflowing funds is equal to the total of the present value of the outflowing funds, and the net present value is equal to zero. It is a return rate that investment is eagerly to achieve, and the larger the index is, the better the index is. Generally, when the internal rate of return is equal to or greater than the reference rate of return, the project is possible.

IRR formula:

wherein NPV is net present value, CF _i Is the net cash flow of the ith year, n is the return on investment period, and IRR is the internal rate of return.

The internal rate of return on investment and the net cash flow per year are input according to the internal rate of return calculation formula described above, and the internal rates of return for different battery capacities are calculated, as shown in fig. 5.

130. Calculating financial costs in the return on investment period for different battery capacities within an acceptable internal rate of return according to loan rates;

specifically, the financial cost = principal + interest, and fig. 6 is a schematic diagram of 10-year revenue and expenditure under different battery capacities.

140. Calculating profits under different battery capacities according to the initial investment and the financial cost;

specifically, profit = total profit-principal-interest, and fig. 7 is a schematic diagram of 10-year total profit for different battery capacities.

150. And outputting the battery capacity with the highest profit as a final battery capacity selection scheme.

The invention provides a battery energy storage capacity estimation method based on a reinforcement learning algorithm, which has the following beneficial effects:

2. Based on historical data of the building demand side, more accurate calculation results can be given for each project to support investment decisions.

3. Various conditions (such as electricity price volatility, unit battery cost, battery model selection, strategy constraint conditions and the like) can be flexibly changed, and correspondingly changed accurate calculation results can be given.

4. The characteristic setting (manual setting range) based on the fluctuation of the electricity price can give calculation uncertainty and return on investment risk.

The battery energy storage capacity estimation system based on the reinforcement learning algorithm, which is disclosed by the exemplary embodiment of the invention, comprises the following components:

the second calculation module is used for taking the electric charge saved every year as net cash flow and calculating the internal yield in the return on investment period under different battery capacities;

Optionally, in this embodiment, the system further includes a training module configured to train to obtain the policy network by using a reinforcement Learning Q-Learning algorithm, where the training module specifically includes:

a calculation unit for calculating a reward according to the selected action;

a first updating unit for updating the electricity price in the battery;

a second updating unit for updating the Q table;

Optionally, in this embodiment, the selecting unit is specifically configured to:

With respect to the system in the above embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated herein.

An exemplary embodiment of the invention illustrates a computing device comprising a memory and a processor.

The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory may include various types of storage units such as system memory, read Only Memory (ROM), and permanent storage. Wherein the ROM may store static data or instructions that are required by the processor or other modules of the computer. The persistent storage device may be a read-write storage device. The persistent storage may be a non-volatile storage device that does not lose stored instructions and data even after the computer is powered off. In some embodiments, the persistent storage device employs a mass storage device (e.g., magnetic or optical disk, flash memory) as the persistent storage device. In other embodiments, the permanent storage may be a removable storage device (e.g., floppy disk, optical drive). The system memory may be a read-write memory device or a volatile read-write memory device, such as a dynamic random access memory. The system memory may store instructions and data that some or all of the processors require at runtime. Further, the memory may comprise any combination of computer-readable storage media, including various types of semiconductor memory chips (DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), magnetic and/or optical disks, may also be employed. In some embodiments, the memory may include a removable storage device that is readable and/or writable, such as a Compact Disc (CD), a read-only digital versatile disc (e.g., DVD-ROM, dual layer DVD-ROM), a read-only Blu-ray disc, an ultra-density optical disc, a flash memory card (e.g., SD card, min SD card, micro-SD card, etc.), a magnetic floppy disc, or the like. Computer-readable storage media do not contain carrier waves or transitory electronic signals transmitted by wireless or wired means.

The memory has stored thereon executable code which, when processed by the processor, causes the processor to perform some or all of the methods described above.

Furthermore, the method according to the invention may also be implemented as a computer program or computer program product comprising computer program code instructions for carrying out part or all of the steps of the above-described method of the invention.

Alternatively, the invention may also be embodied as a non-transitory machine-readable storage medium (or computer-readable storage medium, or machine-readable storage medium) having stored thereon executable code (or a computer program, or computer instruction code) which, when executed by a processor of an electronic device (or computing device, server, etc.), causes the processor to perform part or all of the various steps of the above-described method according to the invention.

The aspects of the invention have been described in detail hereinabove with reference to the drawings. In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments. Those skilled in the art should also appreciate that the acts and modules referred to in the specification are not necessarily required by the invention. In addition, it can be understood that the steps in the method according to the embodiment of the present invention may be sequentially adjusted, combined, and deleted according to actual needs, and the modules in the device according to the embodiment of the present invention may be combined, divided, and deleted according to actual needs.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems and methods according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

While embodiments of the present invention have been described above, the above description is illustrative, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. The method for estimating the energy storage capacity of the battery based on the reinforcement learning algorithm is characterized by comprising the following steps of:

according to strategy networks under different battery capacities, after the energy storage battery is added in a simulation mode, the charging and discharging actions of the energy storage battery are achieved to obtain an optimal strategy, and the electricity charge saved each year under the optimal strategy is calculated;

2. The method of claim 1, further comprising the step of training the policy network using a reinforcement Learning Q-Learning algorithm, specifically comprising:

step 1, initializing algorithm parameters and a Q table;

step 3, inquiring the Q value of each action which can be taken by the battery in the current state in the Q table, and selecting the action according to the inquired Q value;

step 4, calculating the reward according to the selected action;

step 5, updating the electricity price in the battery;

step 6, updating the Q table;

and (5) repeatedly executing the steps 2-7 until the algorithm is converged.

3. The method according to claim 2, wherein in step 3, selecting an action according to the queried Q value specifically includes:

4. The method according to claim 2, wherein the step 4 specifically comprises:

under the discharge state of the battery, if the electricity price in the battery is higher than the electricity price of the power grid, negative reward is obtained through calculation, and if the electricity price in the battery is not higher than the electricity price of the power grid, positive reward is obtained through calculation;

5. The method according to any of claims 1-4, wherein the internal rate of return is calculated by the formula:

6. The battery energy storage capacity estimation system based on the reinforcement learning algorithm is characterized by comprising the following steps:

7. The system according to claim 6, further comprising a training module for training the strategy network by using a Q-Learning algorithm of reinforcement Learning, wherein the training module specifically comprises:

a calculation unit for calculating a reward according to the selected action;

a first updating unit for updating the electricity price in the battery;

a second updating unit for updating the Q table;

8. The system according to claim 7, wherein the selection unit is specifically configured to:

9. A terminal device, comprising:

a processor; and

a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method of any one of claims 1-5.

10. A non-transitory machine-readable storage medium having executable code stored thereon, wherein the executable code, when executed by a processor of an electronic device, causes the processor to perform the method of any of claims 1-5.