CN110598925A

CN110598925A - Energy storage in-trading market decision optimization method based on double-Q learning algorithm

Info

Publication number: CN110598925A
Application number: CN201910832395.4A
Authority: CN
Inventors: 余运俊; 蔡振奋
Original assignee: Nanchang University
Current assignee: Nanchang University
Priority date: 2019-09-04
Filing date: 2019-09-04
Publication date: 2019-12-20

Abstract

A decision optimization method for energy storage in trading market based on a double-Q learning algorithm comprises the following steps: establishing a mathematical model for energy storage in a trading market decision; describing the energy storage operation as a Markov decision process; performing iterative training on the two data sets by adopting real historical market trading price data and applying a Double-Q learning algorithm to obtain a trained Q table; the stored energy executes the action of decision-making goal maximization in the trained Q table, and the accumulated reward under the joint arbitrage is obtained. The Double-Q learning algorithm adopts two functions to update the Q table in an iterative manner, so that the influence of the Q-learning algorithm on overestimation is reduced, the designed arbitrage strategy is more stable, and the energy storage long-term arbitrage benefit is higher; the arbitrage source is not limited to the electric power market, and the carbon market is added, so that the arbitrage income is remarkably increased.

Description

Energy storage in-trading market decision optimization method based on double-Q learning algorithm

Technical Field

The invention belongs to the technical field of engineering.

Background

With the increasing penetration of renewable resources, it is important to achieve this balance efficiently, considering the high uncertainty of wind and solar energy. The energy storage system can continuously absorb energy and timely release the energy so as to meet a large amount of requirements of users on electric quantity, relieve overload of electric power on a power grid, optimize configuration of the power grid system, maintain complete stable operation of the power grid, meet requirements of different users on the electric power, supplement changeable renewable energy sources, and increasingly attach importance to economic feasibility. One of the most commonly discussed sources of revenue for energy storage is real-time price arbitrage, i.e., energy storage utilizes the price difference of real-time electricity market prices, charging at low electricity prices, and discharging at high electricity prices to realize profitability.

Due to the fact that price fluctuation of a real-time power market is large due to the increasing popularization of intermittent renewable power generation, the decision of energy storage in a trading market is greatly concerned by the research community. However, even if the price spread rises, it is still not a simple matter to design a good strategy to obtain significant profits. The first conceivable method is to predict price, but prediction accuracy is difficult to guarantee. Later, approximate dynamic programming was used to derive bidding strategies for stored energy without prior knowledge of price distribution. But such strategies tend to be computationally expensive due to the high dimensionality of the state space. Reinforcement learning is an online learning technology different from supervised learning and unsupervised learning, and a Q-learning algorithm in reinforcement learning provides an energy storage decision strategy for people in a data-driven framework.

The existing energy storage decision method based on reinforcement learning has the following defects: only the electricity price information is decided, and the source of the decision information is single; the Q-learning algorithm has the overestimation problem and the performance of the algorithm is unstable.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides an energy storage in trading market decision optimization method based on a double-Q learning algorithm.

The invention is realized by the following technical scheme.

The invention relates to an energy storage in-trading market decision optimization method based on a double-Q learning algorithm, which comprises the following steps:

step 1: establishing a mathematical model for energy storage in a trading market decision;

step 2: describing the energy storage operation as a Markov decision process;

and step 3: performing iterative training on the two data sets by adopting real historical market trading price data and applying a Double-Q learning algorithm to obtain a trained Q table;

and 4, step 4: the stored energy executes the action of decision-making goal maximization in the trained Q table, and the accumulated reward under the joint arbitrage is obtained.

Further, the step 1 comprises the following steps:

step 1-1: determining an objective function of an energy storage decision;

step 1-2: determining a stored electricity constraint of the energy storage system;

step 1-3: and determining charge and discharge power constraints of the energy storage system.

Further, the step 2 comprises the following steps:

step 2-1: setting the action of storing energy as a function of price;

step 2-2: determining an energy storage state space;

step 2-3: determining an energy storage action space;

step 2-4: an action reward function is determined.

Further, the step 3 includes the following steps:

step 3-1: determining the state of the energy storage system;

step 3-2: selecting an energy storage action according to an element-greedy strategy;

step 3-3: one of the two functions is randomly selected to update the Q value table, and the Q value table is obtained after 3000 iterations.

Compared with the prior art, the invention has the beneficial effects that: (1) the Double-Q learning algorithm adopts two functions to update the Q table in an iterative mode, the influence of the Q-learning algorithm on overestimation is reduced, the designed arbitrage strategy is more stable, and the energy storage long-term decision-making benefit is higher. (2) Price data is not limited to the electricity market only, but is added to the carbon market, so that the cumulative prize is significantly increased.

Drawings

FIG. 1 is a block diagram of a method for energy storage in trading market decision-making.

Figure 2 is a block diagram of a markov decision process.

FIG. 3 is a decision flow diagram of the Double-Q learning algorithm.

Detailed Description

The following description will be made with reference to the drawings.

The invention provides a double-Q learning algorithm-based energy storage in trading market decision optimization method, which makes a decision in electric power and carbon market trading in a certain place by using a double-Q learning algorithm to realize the maximization of accumulated reward, wherein a flow chart of the method is shown as an attached figure 1, and the method specifically comprises the following steps:

step 1: establishing a mathematical model for making decisions of energy storage in an electric power market and a carbon market;

step 2: describing the energy storage operation as a Markov decision process;

and step 3: performing iterative training on the two data sets by adopting real historical carbon price and electricity price data of a certain market, and applying a Double-Q learning algorithm to obtain a trained Q table;

and 4, step 4: the stored energy executes the action of maximizing the decision target in the trained Q table, and the accumulated reward under the decision of the method is obtained.

Further, the step 1 comprises the following steps:

step 1-1: determining an objective function of the energy storage combined arbitrage as follows:

step 1-2: determining a stored charge constraint of the energy storage system as:

step 1-3: determining charge and discharge power constraints of the energy storage system:

further, to describe the energy storage operation as a markov decision process:

step 2-1: the action of storing energy is set as a function of price:

step 2-2: determining an energy storage state space function;

S＝(P,Q)*E (5)

step 2-3: determining an energy storage action space function;

step 2-4: an action reward function is determined.

Further, the step 3, according to the algorithm flowchart of fig. 3, includes the following steps:

step 3-1: acquiring historical price data of electric power and carbon market trading in a certain place, determining the state S of an energy storage system according to a state space function (5), and adding another price information into a decision state space to make a decision in two price distributions, wherein the difference from the prior art is that the decision is made;

step 3-2: the reward values for all actions of each state are calculated according to the reward function (7) for determining the selection of subsequent actions, as shown in table 1:

TABLE 1

Reward value table	Action A1	Action A2
			State S1	R(S1,A1)	R(S1,A2)
State S2	R(S2,A1)	R(s2,A2)
			….	….	….
State Sn	R(Sn,A1)	R(Sn,A2)

Step 3-3: according to the strategy of E-greedy, the algorithm has the probability of E0, 1 to randomly select the action in the action function (6). The action with the maximum reward value in the reward value table is selected with the probability of (1-epsilon), so that the algorithm iteration local optimization is avoided;

step 3-4: Q-Learning is a model-free reinforcement Learning technique that finds an optimal action selection strategy in the MDP problem. It learns through an action-cost function and can ultimately give the desired action based on the current state and the optimal strategy. One advantage of this is that it does not require knowledge of the model of a certain environment to compare expected values for actions. The max operation in standard Q-learning uses the same value to select and weigh an action. This is in fact more likely to select an estimate that is too high, resulting in a too optimistic estimate of the value. To avoid this, we can decouple the selection and the measurement, and each state update randomly selects one of the functions (8) and (9) to update the Q table value, so as to avoid overestimating the action value, and iterate 3000 times to obtain the Q table, and as shown in table 2, the action with the largest Q value in the Q table is selected to obtain the accumulated reward.

TABLE 2

Q value table	Action A1	Action A2
			State S1	Q(S1,A1)	Q(S1,A2)
State S2	Q(S2,A1)	Q(s2,A2)
			….	….	….
State Sn	Q(Sn,A1)	Q(Sn,A2)

Figure 2 is a block diagram of a markov decision process, the objective of which is to find an optimal strategy, i.e. a sequence of actions that maximizes the merit function. For state S at each time, the agent will choose the appropriate action through the optimal strategy. To achieve the maximization of the cumulative reward for the decision objective, we describe the energy storage operation as a markov decision process. Determining the elements of state, action, strategy and reward. And defining the charging and discharging decision as a function related to price information, and designing a double-Q learning strategy to optimally control the real-time decision of the energy storage system in the power market and the carbon market.

Fig. 3 is a flow chart of a double Q learning algorithm, which is more stable in performance than the Q learning algorithm and reduces overestimation after price data is input into a state space for training. When the method is applied to energy storage combined arbitrage, the method is initialized firstly, then the current state is determined, the energy storage action is selected, an epsilon-greedy action selection strategy is adopted, the algorithm has the probability of epsilon 0 and epsilon 1 to randomly select the action, the probability of 1-epsilon selects the optimal action, and the key point is two updating functions adopted by double-Q learning, wherein one updating function is used for determining the value generated by the action, and the other updating function is used for updating a Q value table.

Claims

1. A decision optimization method of energy storage in trading market based on double Q learning algorithm is characterized by comprising the following steps:

step 2: describing the energy storage operation as a Markov decision process;

and step 3: performing iterative training on the two data sets by adopting real historical market trading price data and applying a Double-Qlearning algorithm to obtain a trained Q table;

2. The method for energy storage and market trading decision optimization based on the double-Q learning algorithm as claimed in claim 1, wherein the step 1 comprises the following steps:

step 1-1: determining an objective function of an energy storage decision;

3. The method for energy storage and market trading decision optimization based on the double-Q learning algorithm as claimed in claim 1, wherein the step 2 comprises the following steps:

step 2-1: setting the action of storing energy as a function of price;

step 2-2: determining an energy storage state space;

step 2-3: determining an energy storage action space;

step 2-4: an action reward function is determined.

4. The method for energy storage and market trading decision optimization based on the double-Q learning algorithm as claimed in claim 1, wherein the step 3 comprises the following steps:

step 3-1: determining the state of the energy storage system;