CN111461803A

CN111461803A - Method and system for selecting bidding strategy for cross-country power market price reinforcement learning

Info

Publication number: CN111461803A
Application number: CN201910048373.9A
Authority: CN
Inventors: 李俊辉; 白小保; 周海明; 张志峰; 茹海波; 张帅; 郑磊
Original assignee: State Grid Corp of China SGCC; China Electric Power Research Institute Co Ltd CEPRI; Electric Power Research Institute of State Grid Shandong Electric Power Co Ltd
Current assignee: State Grid Corp of China SGCC; China Electric Power Research Institute Co Ltd CEPRI; Electric Power Research Institute of State Grid Shandong Electric Power Co Ltd
Priority date: 2019-01-18
Filing date: 2019-01-18
Publication date: 2020-07-28

Abstract

The invention provides a method and a system for selecting a bidding strategy for cross-country power market price quotation by reinforcement learning, which are used for acquiring a bidding strategy set; substituting the bidding strategy set into a pre-established reinforcement learning RE algorithm model, and calculating the behavior tendency corresponding to the selected bidding strategy in a wheel disc mode; iteratively calculating a probability selection function of each bidding strategy in the bidding strategy set according to the behavior tendency corresponding to the bidding strategy selected by the power transaction operator until a convergence condition is met; and selecting a bidding strategy based on the probability selection function meeting the convergence condition.

Description

Method and system for selecting bidding strategy for cross-country power market price reinforcement learning

Technical Field

The invention relates to a method and a system, in particular to a method and a system for selecting a bidding strategy for cross-country electric power market price reinforcement learning.

Background

In the global energy internet, market union is an important means for promoting transnational electric power trading, and occurs between countries, regions and regions, but in the global electric power market union, a decision process between multiple electric power market operators and an interaction process between multiple electric power suppliers are complex dynamic problems, and are difficult to analyze and calculate by using a traditional analysis method, which is particularly prominent in medium-and long-term electric power market trading.

At present, two methods are mainly used for solving the transnational power market transaction, one method is based on the traditional optimization theory, a multi-level architecture is applied, the production benefit optimization problem of a power generator is taken as a core, and the power market transaction optimization is realized through the optimal trend of a transcontinent power operation backbone power network. The other method is based on random optimization, a Monte Carlo method is used, and a transaction game is developed under the condition of incomplete information from the optimal quotation of an operator, so that the game result reaches Nash balance.

However, due to the particularity of the power market, the power market transaction is constrained by multi-party conditions, even under the assumption of complete information and single-time-period transaction, the existence/uniqueness of nash equilibrium is a difficulty which is generally concerned, in addition, the global energy internet has a complex model in market union, the optimal reporting price of an operator is realized under the condition of multi-transaction time period and incomplete information, and the optimal production benefit of a generator is difficult to solve from the aspect of an analytic mathematical model.

With the development of artificial intelligence technology, reinforcement learning is a novel effective calculation method for processing the optimal strategy problem, reinforcement learning is a machine learning method based on the animal learning conditioned reflex principle, a reinforcement learning system mainly comprises environments and agents, common reinforcement learning main algorithms include a Q-learning method, a (Roth-Erev) RE method and the like, and a basic framework is shown in fig. 2.

The Agent comprises three parts: the system comprises an input module I, an enhancement module R and a strategy module P. The input module I converts the state of the description environment into a state which can be accepted by the adaptation Agent, and provides input X for the strategy module; the enhancement module assigns each state of the environment to a value r, and an enhancement signal can be directly or indirectly obtained from the state of the environment and is closely related to a subjective target; the policy module P is the most critical module, and its main function is to update Agent's knowledge through a learning mechanism, and at the same time, enable Agent to select an action according to a certain policy and act on the environment.

In a power combined scene of a cross-country power market, the following two problems can exist based on the learning mechanism model: first, if a policy action causes a very large negative profit, and the corresponding action trend is negative, it is very likely that the selection probability is negative, which does not meet the probability definition; second, if the profit is 0, the behavior tendency of each behavior strategy is reduced by the same proportion, so that the selection probability corresponding to each behavior strategy is kept unchanged, and the learning is stopped.

Disclosure of Invention

In order to solve the problems, the invention provides a method and a system for selecting a bidding strategy for a reinforcement learning cross-country power market, which optimize an RE reinforcement learning algorithm and apply the algorithm to a cross-country power market combined scene, so that the optimal overall price and the optimal production benefit of a power generator in all power trading market combinations are realized under the condition of multiple trading periods and incomplete information.

In order to achieve the purpose of the invention, the invention adopts the following technical scheme:

a reinforcement learning cross-country power market bid strategy selection method, the method comprising:

acquiring a bidding strategy set;

substituting the bidding strategy set into a pre-established reinforcement learning RE algorithm model, and calculating the behavior tendency corresponding to the selected bidding strategy in a wheel disc mode;

iteratively calculating a probability selection function of each bidding strategy in the bidding strategy set according to the behavior tendency corresponding to the bidding strategy selected by the power transaction operator until a convergence condition is met;

and selecting a bidding strategy based on the probability selection function meeting the convergence condition.

Preferably, the building of the reinforcement learning RE algorithm model includes:

determining a response function of the reinforcement learning RE algorithm model based on the competitive bidding income of the power transaction operator in the current round;

and obtaining the reinforcement learning RE algorithm model based on the response function of the reinforcement learning RE algorithm model.

Further, the response function in the reinforcement learning RE algorithm model is determined by the following formula;

in the formula, R_im(D) For the response function of the reinforcement learning RE algorithm model, M is the total number of operators, profit_ik(D) The competitive bidding income of the electric power transaction operator in the No. D turn is represented by the number D of the current turns; k is the bidding strategy number.

Further, obtaining the bidding revenue of the electricity trading operator in the current round comprises:

respectively generating quotations based on each bidding strategy in the bidding strategy set;

and determining the bidding income of the power transaction operator in the current turn based on the clearing information and the bidding strategy corresponding to the quoted price.

Further, the generating the price quote respectively based on each bidding strategy in the bidding strategy set includes:

initializing bidding strategy set of power transaction operator

Initial function c_i(q_Gi) Initial behavior tendency q_im(0) Initial selection probability p_im(0) Constraint conditions and price, i is the ith electricity transaction operator;

the electric power transaction operator selects the bidding strategy

Generating corresponding offers fi (q)_Gi)＝c_i(q_Gi)；

Wherein the initial behavior tendency q_im(0)＝q_i(0) Initial selection probability p_im(0) Is 1/M, and M is the number of total operators.

Further, the determining the bidding income of the power transaction operator in the current round based on the clearing information and the bidding strategy corresponding to the quote comprises:

after all operators submit quotations, clearing information is formulated according to predefined clearing rules, the clearing information is fed back to the electric power transaction operator, and the clearing information is sent to the power generator by the electric power transaction operator;

the electric power transaction operator obtains bidding benefits of the current round according to the clearing information and the selected bidding strategy; wherein the content of the first and second substances,

the clearing information comprises: clearing price and middle standard electric quantity.

Preferably, the behavior tendency corresponding to the selected bidding strategy is determined by the following formula:

q_im(D+1)＝[1-r]q_im(D)+R_im(D)

in the formula, q_im(D)Showing selection of bidding strategy i in the Dth round_mTendency of behavior of q_im(D+1)Indicating selection of bidding strategy i in the next round of the D-th round_mR denotes a certain behavior, R_im(D) The response function of the RE algorithm model is learned for reinforcement.

Further, a probability selection function of each bidding strategy in the bidding strategy set is determined according to the following formula:

in the formula, p_im(D) Showing selection of bidding strategy a by electric power transaction operator_mK is the bidding strategy quantity, and c is the cooling coefficient; q. q.s_ij(D)Representing the behavior tendency corresponding to the bidding strategy selected by the jth power transaction operator in the D round; m is the total number of power transaction operators, and e is an experience parameter.

A reinforcement learning cross-country power market bid strategy selection system, the system comprising:

the obtaining module is used for obtaining a bidding strategy set;

the determining module is used for substituting the bidding strategy set into a pre-established reinforcement learning RE algorithm model and calculating the behavior tendency corresponding to the selected bidding strategy in a wheel disc mode;

the iterative computation module is used for iteratively computing a probability selection function of each bidding strategy in the bidding strategy set according to the behavior tendency corresponding to the bidding strategy selected by the power transaction operator until a convergence condition is met;

and the selection module is used for selecting the bidding strategy based on the probability selection function meeting the convergence condition.

Preferably, the determining module includes:

the determining unit is used for determining a response function of the reinforcement learning RE algorithm model based on the competitive bidding income of the power transaction operator in the current turn;

and the obtaining unit is used for obtaining the reinforcement learning RE algorithm model based on the response function of the reinforcement learning RE algorithm model.

Compared with the closest prior art, the technical scheme provided by the invention has the following beneficial effects:

the invention provides a method and a system for selecting a bidding strategy for strengthening learning cross-country power market quotation, which can be applied to a cross-country power market power combination scene, and a bidding strategy set is obtained; substituting the bidding strategy set into a pre-established reinforcement learning RE algorithm model, and calculating the behavior tendency corresponding to the selected bidding strategy in a wheel disc mode; the problem of negative value behavior tendency and learning interruption of a reinforcement learning RE general algorithm model is solved, clear price selection in a cross-country electric power market electric power combined scene is stable, and powerful technical support can be provided for price strategies of operators.

Iteratively calculating a probability selection function of each bidding strategy in the bidding strategy set according to the behavior tendency corresponding to the bidding strategy selected by the power transaction operator until a convergence condition is met; and selecting a bidding strategy based on the probability selection function meeting the convergence condition. The probability selection function is iteratively calculated until convergence is achieved, so that the accuracy of strategy selection is enhanced, and the selected result is closer to the actual situation.

Drawings

FIG. 1 is a flow chart of a cross-country power market bid bidding strategy selection method based on reinforcement learning provided in an embodiment of the present invention;

FIG. 2 is a basic framework diagram of a reinforcement learning system provided in the background of the invention;

fig. 3 is a flowchart of a cross-country operator electricity trading market quotation algorithm provided in an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings.

The method mainly optimizes the RE reinforcement learning algorithm and applies the algorithm to a cross-country power market union scene, so that the optimal overall price and the optimal production benefit of a power generator in all power market unions are realized under the condition of multiple trading periods and incomplete information.

Introduction of basic principle:

suppose that the optional policy set of the electric power trade operator Agent i of each country is A (a)₁,a₂,…,a_i,a_m) If the game is repeated, the game is played in the D-th round strategy a_kIs selected and Age is calculatednti the benefit of this round is profit_ik(D) Then at round D +1, for strategy a_mThe trend update formula of (1) is as follows:

q_im(D+1)＝[1-r]q_im(D)+R_im(D) (2-1)

wherein the response function

In the formula, r is a forgetting factor which plays a role in inhibiting the increase of each behavior trend along with the time, reduces the importance of previous experience, and enhances the influence of a new strategy. e is an experience parameter which plays an encouraging role for the Agent to generate various quotation strategies in the early learning stage of the repeated game.

At this time, strategy i_mThe selection probability formula of (1) is as follows:

and the Agent i selects the next round of strategy behaviors according to the new selection probability and a wheel disc mode.

Marking the tendency coefficient q for each behavior i in the feasible domain_iAnd probability coefficient p_iAnd each round of bidding is updated according to the income. In the case of a suitable coefficient adaptation, a convergence state may be reached, i.e. the probability p that a certain behavior r is selected_rApproaching 1. This means that the quote always performs an action r in the feasible domain when the agent reaches the converged state. Each generator set agent adopts RE reinforcement learning algorithm to make price decision of each round of transaction, the best declared price is searched in repeated auction so that the selection probability of various quotation strategies is the same when the profit is maximized and the initial price is made, then the updated quotation is obtained according to the learning algorithm, and the process is circulated until the final balance is reached.

In order to solve the problems that in a power combined scene of a transnational power market, an RE algorithm model has the following two problems: first, if a policy's behavior results in a very large negative valueIf the behavior tendency is negative, the selection probability is negative, which is not in accordance with the probability definition; second, if the profit_ik(D) When the value is 0, the behavior tendency of each behavior strategy is reduced by the same proportion, so that the selection probability corresponding to each behavior strategy is kept unchanged, and the learning is stopped; a method for selecting a bidding strategy for cross-country power market price reinforcement learning is provided, as shown in FIGS. 1 and 3, the specific operation steps are as follows:

s1, acquiring a bidding strategy set;

s2, substituting the bidding strategy set into a pre-established reinforcement learning RE algorithm model, and calculating the behavior tendency corresponding to the selected bidding strategy in a wheel disc mode;

s3, according to the behavior tendency corresponding to the bidding strategy selected by the power transaction operator, iteratively calculating the probability selection function of each bidding strategy in the bidding strategy set until the convergence condition is satisfied;

s4 selects a bidding strategy based on the probability selection function satisfying the convergence condition.

In step S1, the building of the reinforcement learning RE algorithm model includes:

Wherein, the response function in the reinforcement learning RE algorithm model is determined by the following formula;

Wherein obtaining the competitive bidding revenue of the electricity trading operator in the current round comprises:

step a, respectively generating quotations based on each bidding strategy in a bidding strategy set;

and b, determining the bidding income of the power transaction operator in the current round based on the clearing information and the bidding strategy corresponding to the quoted price.

In step a, respectively generating bids based on each bidding strategy in the bidding strategy set comprises:

initializing bidding strategy set of power transaction operator

the electric power transaction operator selects the bidding strategy

Generating corresponding offers fi (q)_Gi)＝c_i(q_Gi)；

In step b, the determining the bidding income of the power transaction operator in the current round based on the clearing information and the bidding strategy corresponding to the quote comprises:

the electric power transaction operator obtains bidding benefits of the current round according to the clearing information and the selected bidding strategy; wherein the clearing information comprises: clearing price and middle standard electric quantity.

In step S2, the behavior tendency corresponding to the selected bidding strategy is determined according to the following formula:

q_im(D+1)＝[1-r]q_im(D)+R_im(D)

In step S3, a probability selection function of each bidding strategy in the bidding strategy set is determined according to the following formula:

in the formula, p_im(D) Showing selection of bidding strategy i by power transaction operator_mThe probability selection function k is the bidding strategy quantity, and c is the cooling coefficient; q. q.s_ij(D) Representing the behavior tendency corresponding to the bidding strategy selected by the jth power transaction operator in the D round; m is the total number of power transaction operators, and e is an experience parameter.

Further, the convergence condition in step S3 is customized, and when the convergence condition is not satisfied, the process returns to step S1, otherwise, the process ends.

According to the specific implementation mode, the key point of the invention lies in the selection of the response function and the probability selection function in the reinforcement learning RE algorithm model, so that the invention protects the combined application of the two functions and the similar functions in the patent and the modified reinforcement learning RE algorithm model in the electric power market.

Example (b):

table 1 lists the results of the simulation in which AVE refers to the mean, SD refers to the standard deviation, and S% represents the percentage of the standard deviation relative to the mean. Experiments show that the average clearing price of the new algorithm is higher than that of the general algorithm, the value of the standard deviation calculated by the general algorithm is larger than that of the new algorithm, and the fluctuation of the clearing price of the new algorithm is reduced after modification, the clearing price is more accurate, the quotation of each operator is closer to the actual current situation, and the benefit of each power operator is favorably ensured.

TABLE 1 average out price (k ￥/MWh)

Based on the same inventive concept, the application also provides a system for selecting the bidding strategy for the cross-country power market through reinforcement learning, which comprises the following steps:

the obtaining module is used for obtaining a bidding strategy set;

Wherein, the determining module further comprises:

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Claims

1. A method for selecting a bidding strategy for strengthening learning cross-country power market quotation is characterized by comprising the following steps:

acquiring a bidding strategy set;

2. The method of claim 1, wherein the building of the reinforcement learning RE algorithm model comprises:

3. The method of claim 2, wherein the response function in the reinforcement learning RE algorithm model is determined by the following equation;

4. The method of claim 3, wherein obtaining a bid revenue for the electricity trading operator in a current round comprises:

5. The method of claim 4, wherein the generating bids based on each bidding policy in the set of bidding policies separately comprises:

initializing bidding strategy set of power transaction operator

the electric power transaction operator selects the bidding strategy

Generating corresponding offers fi (q)_Gi)＝c_i(q_Gi)；

6. The method of claim 4, wherein determining a bidding return of the power trading operator in the current round based on the clearing information and bidding strategy corresponding to the bid price comprises:

7. The method of claim 1, wherein the behavioral propensity corresponding to the selected bidding strategy is determined by:

q_im(D+1)＝[1-r]q_im(D)+R_im(D)

8. The method of claim 7, wherein the probability selection function for each bidding strategy in the set of bidding strategies is determined by:

in the formula, p_im(D) Showing selection of bidding strategy i by power transaction operator_mK is the bidding strategy quantity, and c is the cooling coefficient; q. q.s_ij(D)Representing the behavior tendency corresponding to the bidding strategy selected by the jth power transaction operator in the D round; m is the total number of power transaction operators, and e is an experience parameter.

9. A system for reinforcement learning cross-country power market bid strategy selection, the system comprising:

the obtaining module is used for obtaining a bidding strategy set;

10. The system of claim 9, wherein the determination module comprises: