CN112036936A - Deep Q network-based generator bidding behavior simulation method and system - Google Patents
Deep Q network-based generator bidding behavior simulation method and system Download PDFInfo
- Publication number
- CN112036936A CN112036936A CN202010836213.3A CN202010836213A CN112036936A CN 112036936 A CN112036936 A CN 112036936A CN 202010836213 A CN202010836213 A CN 202010836213A CN 112036936 A CN112036936 A CN 112036936A
- Authority
- CN
- China
- Prior art keywords
- network
- market
- bidding
- max
- value network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/02—Marketing; Price estimation or determination; Fundraising
- G06Q30/0201—Market modelling; Market analysis; Collecting market data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/02—Marketing; Price estimation or determination; Fundraising
- G06Q30/0201—Market modelling; Market analysis; Collecting market data
- G06Q30/0206—Price or cost determination based on market factors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/02—Marketing; Price estimation or determination; Fundraising
- G06Q30/0207—Discounts or incentives, e.g. coupons or rebates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/06—Buying, selling or leasing transactions
- G06Q30/0601—Electronic shopping [e-shopping]
- G06Q30/0611—Request for offers or quotes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
- G06Q50/06—Electricity, gas or water supply
Landscapes
- Business, Economics & Management (AREA)
- Engineering & Computer Science (AREA)
- Strategic Management (AREA)
- Accounting & Taxation (AREA)
- Finance (AREA)
- Development Economics (AREA)
- Economics (AREA)
- General Business, Economics & Management (AREA)
- Entrepreneurship & Innovation (AREA)
- Marketing (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Game Theory and Decision Science (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Public Health (AREA)
- Water Supply & Treatment (AREA)
- General Health & Medical Sciences (AREA)
- Human Resources & Organizations (AREA)
- Primary Health Care (AREA)
- Tourism & Hospitality (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Supply And Distribution Of Alternating Current (AREA)
Abstract
The invention discloses a method and a system for simulating bidding behaviors of a power generator based on a deep Q network, wherein the method comprises the following steps: constructing a state space S, an action space A and a reward function; setting proxy model parameters and initializing the proxy model; wherein the parameters include: motion space parameter Amin、AmaxH; a state space dimension; exploring the probability; the number of structural layers of the current value network and the target value network, the number of neurons in each layer, an activation function and optimizer parameters; the agent model carries out declaration bidding, and a market operating mechanism carries out market clearing calculation according to the declaration bidding; the agent model trains a synchronous value network until an ending condition is met; wherein the end condition includes: reach the maximum learning times or reach the equilibrium state in the marketState. The method can solve the problem that the traditional RL algorithm is difficult to effectively simulate the actual bidding behavior of the power generator, and improve the accuracy and the rationality of the bidding behavior of the agent model.
Description
Technical Field
The invention relates to the technical field of power markets, in particular to a method and a system for simulating bidding behaviors of a power generator based on a deep Q network.
Background
The power market simulation technology mainly comprises two methods of experimental economics and agent-based computational economics. The experimental economics mainly simulate the quotation behaviors of power generation members in the real market through the decision and performance of testers participating in the experiment through reasonable experimental design, but the experimental economics are limited by the number of the participants and the cognitive level of the market, the randomness of experimental results is high, and the relationship with the overall market needs to be demonstrated. The computational economics method embeds the intelligent agent model into the electric power market simulation research framework, and makes quotation decision by using an artificial intelligence method. Thus, in contrast, the agent-based computational economics approach is favored by researchers. The generator agent model is the basis and the difficulty of the market simulation technology based on computational economics, the result of the model not only influences the clearing result of the market simulation, but also determines the rationality of the market dynamic simulation result due to the rationality.
At present, a generator bidding simulation algorithm based on intelligent agents at home and abroad has already obtained a certain research result, but most of the research results are focused on a traditional Reinforced Learning (RL) algorithm; for example, the following several proxy models: (1) a generator agent model is established based on a generation countermeasure network (GAN), and bidding behaviors of the generator agent model are mined from historical and simulation data, but the model ignores decision-making capability of an actual generator;
(2) a multi-input decision factor agent model of the power generator is established, the dynamic behavior evolution process of the power generator under the condition of load demand change is simulated, however, the adopted VRE algorithm is a learning model of a one-dimensional environment, so the effectiveness of the model on the observation and perception of the market environment needs to be verified;
(3) a generator bidding decision program module of the Q learning algorithm is developed, and the Q learning algorithm has stronger exploration capacity, but is limited by the limitation of the traditional Q learning algorithm, the model can only process discrete and low-dimensional market environment, and the situations of line blockage, historical bid winning and the like are not considered during decision making.
However, the actual power market is a complex system, and the bidding behavior of the generator is more a problem of large-scale and continuous space Markov Decision Process (MDP), so that the traditional RL algorithm is difficult to effectively simulate the actual bidding behavior of the generator.
Disclosure of Invention
The purpose of the invention is: the method and the system for simulating the bidding behaviors of the power generator based on the deep Q network can solve the problem that the traditional RL algorithm is difficult to effectively simulate the actual bidding behaviors of the power generator, and improve the accuracy and the rationality of the bidding behaviors of the agent model.
In order to achieve the purpose, the invention provides a generator bidding behavior simulation method based on a deep Q network, which comprises the following steps:
constructing a state space S, an action space A and a reward function; the state space S selects the power price of a time node, a time highest middle mark section and a time blocking condition of a connected line as state characteristics; the action space A is constructed based on a marginal cost curve; the reward function is obtained according to the power generation profit;
setting proxy model parameters and initializing the proxy model; wherein the parameters include: motion space parameter Amin、AmaxH; a state space dimension; exploring the probability; the number of structural layers of the current value network and the target value network, the number of neurons in each layer, an activation function and optimizer parameters; playback memory unit capacity; maximum ofLearning times, current value network and target value network synchronization frequency tstep;
The agent model carries out declaration bidding, and a market operating mechanism carries out market clearing calculation according to the declaration bidding;
the agent model trains a synchronous value network until an ending condition is met; wherein the end condition includes: the maximum number of learning is reached or the market has reached a state of equilibrium.
Further, the marginal cost calculation formula is as follows:
CM(P)=a+2bP
in the formula, a and b are coefficients of a first term and a second term of the cost function respectively; p is the output of the unit;
each action is to multiply the marginal cost by a factor, A ∈ [ A ]min,Amax]Divided into increasing H equal parts, AminAnd AmaxThe minimum and maximum selectable coefficients, respectively. If the agent model selects the ith action, the corresponding coefficients are:
Ai=Amin+i/H*(Amax-Amin)
then its quoted price is:
CB=CMAi。
further, the initializing the proxy model specifically includes: the method specifically comprises the following steps: initializing a market environment state sequence as s according to the selected state features in the state space1And obtaining phi after max-min normalization pretreatment1=φ(s1) (ii) a Initializing a current value network weight parameter theta and enabling a target value network weight parameter theta-=θ。
Further, the agent model performs declaration bidding, and the market operating mechanism performs market clearing calculation according to the declaration bidding, specifically: choose-greedy exploration mode, i.e. randomly choosing action a with probabilitytOtherwise, select action at=argmaxaQ(φtA | θ); action atAfter determination, according to formula CB=CMAiComputingObtaining a corresponding quotation strategy and reporting the quotation strategy to a market operator organization; the market operation mechanism calculates the optimal trend and provides relevant market clearing information by taking the minimized power generation cost during single-side quotation or the maximized social welfare during double-side quotation as clearing targets based on the quotation information, the market load, the power grid topological structure and the market rules of the market.
Further, the agent model training synchronization value network specifically includes: according to the reward function rtAnd the next market environment state sequence st+1Simultaneously obtaining phi by max-min normalization processingt+1=φ(st+1) And storing the transfer sequence (phi)t,at,rt,φt+1) To a playback memory unit; the proxy model randomly samples a fixed number of transfer samples (phi) from the memory cellsj,aj,rj,φj+1) Calculating an optimization objective Y from the objective networkj=rj+γmaxa'Q(φj+1,a'|θ-) And calculating an error function (Y)j-Q(φj,aj|θ))2(ii) a Updating the current value of the network weight parameter theta by using a gradient descent method according to the error function, and simultaneously, updating the current value of the network weight parameter theta every tstepTime step synchronization target value network weight theta-θ; if the proxy model meets the end condition, ending the simulation, and calculating and outputting a final result; and if the agent model does not meet the end condition, returning to execute the agent model to declare a bid.
The embodiment of the invention also provides a generator bidding behavior simulation system based on the deep Q network, which comprises the following steps: the construction unit is used for constructing a state space S, an action space A and a reward function; the state space S selects the power price of a time node, a time highest middle mark section and a time blocking condition of a connected line as state characteristics; the action space A is constructed based on a marginal cost curve; the reward function is obtained according to the power generation profit;
the proxy model processing unit sets proxy model parameters and initializes the proxy model; wherein the parameters include: action spaceInter-parameter Amin、AmaxH; a state space dimension; exploring the probability; the number of structural layers of the current value network and the target value network, the number of neurons in each layer, an activation function and optimizer parameters; playback memory unit capacity; maximum learning times, current value network and target value network synchronization frequency tstep;
The declaration bidding unit is used for performing declaration bidding on the agent model, and the market operating mechanism performs market clearing calculation according to the declaration bidding;
the training unit is used for training the synchronous value network for the agent model until an ending condition is met; wherein the end condition includes: the maximum number of learning is reached or the market has reached a state of equilibrium.
Further, the marginal cost calculation formula is as follows:
CM(P)=a+2bP
in the formula, a and b are coefficients of a first term and a second term of the cost function respectively; p is the output of the unit;
each action is to multiply the marginal cost by a factor, A ∈ [ A ]min,Amax]Divided into increasing H equal parts, AminAnd AmaxThe minimum and maximum selectable coefficients, respectively. If the agent model selects the ith action, the corresponding coefficients are:
Ai=Amin+i/H*(Amax-Amin)
then its quoted price is:
CB=CMAi。
further, the initializing the proxy model specifically includes: the method specifically comprises the following steps: initializing a market environment state sequence as s according to the selected state features in the state space1And obtaining phi after max-min normalization pretreatment1=φ(s1) (ii) a Initializing a current value network weight parameter theta and enabling a target value network weight parameter theta-=θ。
Further, the agent model training synchronization value network specifically includes: according to the rewardFunction rtAnd the next market environment state sequence st+1Simultaneously obtaining phi by max-min normalization processingt+1=φ(st+1) And storing the transfer sequence (phi)t,at,rt,φt+1) To a playback memory unit; the proxy model randomly samples a fixed number of transfer samples (phi) from the memory cellsj,aj,rj,φj+1) Calculating an optimization objective Y from the objective networkj=rj+γmaxa'Q(φj+1,a'|θ-) And calculating an error function (Y)j-Q(φj,aj|θ))2(ii) a Updating the current value of the network weight parameter theta by using a gradient descent method according to the error function, and simultaneously, updating the current value of the network weight parameter theta every tstepTime step synchronization target value network weight theta-θ; if the proxy model meets the end condition, ending the simulation, and calculating and outputting a final result; and if the agent model does not meet the end condition, returning to execute the agent model to declare a bid.
Embodiments of the present invention also provide a computer-readable storage medium storing a computer program which, when executed by a processor, implements the deep Q network-based generator bidding behavior simulation method according to any one of claims 1 to 5.
Compared with the prior art, the method and the system for simulating the bidding behavior of the power generator based on the deep Q network have the advantages that:
the invention discloses a method and a system for simulating bidding behaviors of a power generator based on a deep Q network, wherein the method comprises the following steps: constructing a state space S, an action space A and a reward function; the state space S selects the power price of a time node, a time highest middle mark section and a time blocking condition of a connected line as state characteristics; the action space A is constructed based on a marginal cost curve; the reward function is obtained according to the power generation profit; setting proxy model parameters and initializing the proxy model; wherein the parameters include: motion space parameter Amin、AmaxH; dimension of state spaceCounting; exploring the probability; the number of structural layers of the current value network and the target value network, the number of neurons in each layer, an activation function and optimizer parameters; playback memory unit capacity; maximum learning times, current value network and target value network synchronization frequency tstep(ii) a The agent model carries out declaration bidding, and a market operating mechanism carries out market clearing calculation according to the declaration bidding; the agent model trains a synchronous value network until an ending condition is met; wherein the end condition includes: the maximum number of learning is reached or the market has reached a state of equilibrium. The method can solve the problem that the traditional RL algorithm is difficult to effectively simulate the actual bidding behavior of the power generator, and improve the accuracy and the rationality of the bidding behavior of the agent model.
Drawings
Fig. 1 is a schematic flow chart of a generator bidding behavior simulation method based on a deep Q network according to a first embodiment of the present invention;
fig. 2 is a schematic diagram of a market clearing process in a deep Q network-based generator bidding behavior simulation method according to a first embodiment of the present invention;
fig. 3 is a schematic diagram of a training and synchronization value network in a deep Q network-based generator bidding behavior simulation method according to a first embodiment of the present invention;
fig. 4 is a schematic structural diagram of a generator bidding behavior simulation system based on a deep Q network according to a first embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be understood that the step numbers used herein are for convenience of description only and are not intended as limitations on the order in which the steps are performed.
It is to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
The terms "comprises" and "comprising" indicate the presence of the described features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
The term "and/or" refers to and includes any and all possible combinations of one or more of the associated listed items.
The first embodiment of the present invention:
referring to fig. 1 to fig. 3, a generator bidding behavior simulation method based on a deep Q network according to an embodiment of the present invention includes at least the following steps:
s101, constructing a state space S, an action space A and a reward function; the state space S selects the power price of a time node, a time highest middle mark section and a time blocking condition of a connected line as state characteristics; the action space A is constructed based on a marginal cost curve; the reward function is obtained according to the power generation profit;
for step S101, when the state space S is constructed, the selection of the environmental state features determines whether the agent model can effectively and correctly observe and perceive the market environment, which is an important basis for decision making. Therefore, the invention selects the electricity price of the time node, the section with the highest time middle mark and the time blocking condition of the connected line as the state characteristics.
For step S101, in order to facilitate the simulation of the characteristic that the price of the generator is monotonically increased when the action space a is constructed, the patent mainly constructs the action space of the agent model based on the marginal cost curve, and the marginal cost CMThe calculation of (P) is as follows:
CM(P)=a+2bP (1)
in the formula, a and b are coefficients of a first term and a second term of the cost function respectively; and P is the unit output.
Each action is to multiply the marginal cost by a factor, A ∈ [ A ]min,Amax]Divided into increasing H equal parts, AminAnd AmaxThe minimum and maximum selectable coefficients, respectively. If the agent model selects the ith action, the corresponding coefficients are:
Ai=Amin+i/H*(Amax-Amin)
then its quoted price is:
CB=CMAi。
for step S101, the reward function (i.e. the enhanced signal) is related to the subordinate objective of the power generator in making the bidding decision, and can be obtained directly or indirectly according to the market environment and the market result, such as the bid-winning price, the bid-winning amount, the power generation profit, and the power generation profit. For simplicity, the generating profit is mainly selected as the reward function of the agent model.
S102, setting proxy model parameters and initializing the proxy model; wherein the parameters include: motion space parameter Amin、AmaxH; a state space dimension; exploring the probability; the number of structural layers of the current value network and the target value network, the number of neurons in each layer, an activation function and optimizer parameters; playback memory unit capacity; maximum learning times, current value network and target value network synchronization frequency tstep;
For the object of the initialization process of step S102, there are mainly market environment state space and value network weight parameters. Wherein, the state characteristic selected according to the state space S and the action space A is used for initializing the market environment state sequence as S1And obtaining phi after max-min normalization pretreatment1=φ(s1). Initializing a current value network weight parameter theta and enabling a target value network weight parameter theta-=θ。
S103, the agent model carries out declaration bidding, and a market operating mechanism carries out market clearing calculation according to the declaration bidding;
for step S103, the agent model mainly selects a greedy exploration mode, that is, randomly selects an action a with a probabilitytOtherwise, select action at=argmaxaQ(φtAnd a | θ). Action atAfter determination, according to CB=CMAiCalculating to obtain a corresponding quotation strategy and reporting to a market operator organization; and the market operator organization carries out clearing calculation according to the quotation strategies and the market rules submitted by the agent models and publishes clearing information. The market clearing calculation is based on quoted price information, market load, a power grid topological structure and market rules of the market, the electricity generation cost minimization during single-side quoted price or the social welfare maximization during double-side quoted price is taken as a clearing target, the optimal power flow is calculated, and relevant market clearing information is given, wherein the clearing model essentially solves the calculation problems of the unit combination problem and the node marginal electricity price;
s104, training a synchronous value network by the agent model until an ending condition is met; wherein the end condition includes: the maximum number of learning is reached or the market has reached a state of equilibrium.
For step S104, the agent model obtains a reward function r according to market clearing informationtAnd the next market environment state sequence st+1. Meanwhile, obtaining phi by max-min normalization processingt+1=φ(st+1) And storing the transfer sequence (phi)t,at,rt,φt+1) To a playback memory unit; the proxy model randomly samples a fixed number of transition samples (phi) from the memory cellsj,aj,rj,φj+1) Calculating an optimization objective Y from the objective networkj=rj+γmaxa'Q(φj+1,a'|θ-) And calculating an error function (Y)j-Q(φj,aj|θ))2. Updating the current value of the network weight parameter theta by using a gradient descent method according to the error function, and simultaneously, updating the current value of the network weight parameter theta every tstepTime step synchronization target value network weight theta-θ. If the agent meets an end condition, such as reaching a maximum number of learning or the market has reachedIf the balance state is balanced (the bidding strategies of all the agent models are not changed), the simulation is finished, and the final result (the final bidding strategies of all the agent models, the income, the node price and other information) is calculated and output; if not, returning to the agent model to declare the bid until meeting the end condition.
In a certain embodiment of the present invention, the marginal cost calculation formula is as follows:
CM(P)=a+2bP
in the formula, a and b are coefficients of a first term and a second term of the cost function respectively; p is the output of the unit;
each action is to multiply the marginal cost by a factor, A ∈ [ A ]min,Amax]Divided into increasing H equal parts, AminAnd AmaxThe minimum and maximum selectable coefficients, respectively. If the agent model selects the ith action, the corresponding coefficients are:
Ai=Amin+i/H*(Amax-Amin)
then its quoted price is:
CB=CMAi。
in an embodiment of the present invention, the initializing the proxy model specifically includes: the method specifically comprises the following steps: initializing a market environment state sequence as s according to the selected state features in the state space1And obtaining phi after max-min normalization pretreatment1=φ(s1) (ii) a Initializing a current value network weight parameter theta and enabling a target value network weight parameter theta-=θ。
In one embodiment of the present invention, the agent model performs a bid for declaration, and the market operating mechanism performs a market clearing calculation according to the bid for declaration, specifically: choose-greedy exploration mode, i.e. randomly choosing action a with probabilitytOtherwise, select action at=argmaxaQ(φtA | θ); action atAfter determination, according to formula CB=CMAiCalculating to obtain a corresponding quotation strategy and reporting to a market operator organization; the market operating mechanism is based onThe method comprises the steps of calculating optimal tide and giving related market clearing information by taking the minimum power generation cost during single-side quotation or the maximum social welfare during double-side quotation as clearing targets according to quotation information, market load, a power grid topological structure and market rules of the market.
In an embodiment of the present invention, the training of the synchronous value network by the agent model specifically includes: according to the reward function rtAnd the next market environment state sequence st+1Simultaneously obtaining phi by max-min normalization processingt+1=φ(st+1) And storing the transfer sequence (phi)t,at,rt,φt+1) To a playback memory unit; the proxy model randomly samples a fixed number of transfer samples (phi) from the memory cellsj,aj,rj,φj+1) Calculating an optimization objective Y from the objective networkj=rj+γmaxa'Q(φj+1,a'|θ-) And calculating an error function (Y)j-Q(φj,aj|θ))2(ii) a Updating the current value of the network weight parameter theta by using a gradient descent method according to the error function, and simultaneously, updating the current value of the network weight parameter theta every tstepTime step synchronization target value network weight theta-θ; if the proxy model meets the end condition, ending the simulation, and calculating and outputting a final result; and if the agent model does not meet the end condition, returning to execute the agent model to declare a bid.
The embodiment of the invention provides a generator bidding behavior simulation method based on a deep Q network, which comprises the following steps: constructing a state space S, an action space A and a reward function; the state space S selects the power price of a time node, a time highest middle mark section and a time blocking condition of a connected line as state characteristics; the action space A is constructed based on a marginal cost curve; the reward function is obtained according to the power generation profit; setting proxy model parameters and initializing the proxy model; wherein the parameters include: motion space parameter Amin、AmaxH; a state space dimension; exploring the probability; number of structural layers of current value network and target value network, number of neurons in each layerAnd activating functions, optimizer parameters; playback memory unit capacity; maximum learning times, current value network and target value network synchronization frequency tstep(ii) a The agent model carries out declaration bidding, and a market operating mechanism carries out market clearing calculation according to the declaration bidding; the agent model trains a synchronous value network until an ending condition is met; wherein the end condition includes: the maximum number of learning is reached or the market has reached a state of equilibrium. The method can solve the problem that the traditional RL algorithm is difficult to effectively simulate the actual bidding behavior of the power generator, and improve the accuracy and the rationality of the bidding behavior of the agent model.
Second embodiment of the invention:
referring to fig. 4, a generator bidding behavior simulation system 200 based on a deep Q network according to an embodiment of the present invention includes: the system comprises a construction unit 201, an agent model processing unit 202, a bid application unit 203 and a training unit 204; wherein the content of the first and second substances,
the constructing unit 201 is configured to construct a state space S, an action space a, and a reward function; the state space S selects the power price of a time node, a time highest middle mark section and a time blocking condition of a connected line as state characteristics; the action space A is constructed based on a marginal cost curve; the reward function is obtained according to the power generation profit;
the proxy model processing unit 202 sets proxy model parameters and initializes the proxy model; wherein the parameters include: motion space parameter Amin、AmaxH; a state space dimension; exploring the probability; the number of structural layers of the current value network and the target value network, the number of neurons in each layer, an activation function and optimizer parameters; playback memory unit capacity; maximum learning times, current value network and target value network synchronization frequency tstep;
The declaration bidding unit 203 is configured to perform declaration bidding on the agent model, and a market operating mechanism performs market clearing calculation according to the declaration bidding;
the training unit 204 is configured to train a synchronous value network for the proxy model until an end condition is met; wherein the end condition includes: the maximum number of learning is reached or the market has reached a state of equilibrium.
In a certain embodiment of the present invention, the marginal cost calculation formula is as follows:
CM(P)=a+2bP
in the formula, a and b are coefficients of a first term and a second term of the cost function respectively; p is the output of the unit;
each action is to multiply the marginal cost by a factor, A ∈ [ A ]min,Amax]Divided into increasing H equal parts, AminAnd AmaxThe minimum and maximum selectable coefficients, respectively. If the agent model selects the ith action, the corresponding coefficients are:
Ai=Amin+i/H*(Amax-Amin)
then its quoted price is:
CB=CMAi。
in an embodiment of the present invention, the initializing the proxy model specifically includes: the method specifically comprises the following steps: initializing a market environment state sequence as s according to the selected state features in the state space1And obtaining phi after max-min normalization pretreatment1=φ(s1) (ii) a Initializing a current value network weight parameter theta and enabling a target value network weight parameter theta-=θ。
In an embodiment of the present invention, the training of the synchronous value network by the agent model specifically includes: according to the reward function rtAnd the next market environment state sequence st+1Simultaneously obtaining phi by max-min normalization processingt+1=φ(st+1) And storing the transfer sequence (phi)t,at,rt,φt+1) To a playback memory unit; the proxy model randomly samples a fixed number of transfer samples (phi) from the memory cellsj,aj,rj,φj+1) Calculating an optimization objective Y from the objective networkj=rj+γmaxa'Q(φj+1,a'|θ-) And calculateError function (Y)j-Q(φj,aj|θ))2(ii) a Updating the current value of the network weight parameter theta by using a gradient descent method according to the error function, and simultaneously, updating the current value of the network weight parameter theta every tstepTime step synchronization target value network weight theta-θ; if the proxy model meets the end condition, ending the simulation, and calculating and outputting a final result; and if the agent model does not meet the end condition, returning to execute the agent model to declare a bid.
The generator bidding behavior simulation system 200 based on the deep Q network provided by the embodiment of the invention comprises: the system comprises a construction unit 201, an agent model processing unit 202, a bid application unit 203 and a training unit 204; the construction unit 201 is configured to construct a state space S, an action space a, and a reward function; the state space S selects the power price of a time node, a time highest middle mark section and a time blocking condition of a connected line as state characteristics; the action space A is constructed based on a marginal cost curve; the reward function is obtained according to the power generation profit; the proxy model processing unit 202 sets proxy model parameters and initializes the proxy model; wherein the parameters include: motion space parameter Amin、AmaxH; a state space dimension; exploring the probability; the number of structural layers of the current value network and the target value network, the number of neurons in each layer, an activation function and optimizer parameters; playback memory unit capacity; maximum learning times, current value network and target value network synchronization frequency tstep(ii) a The declaration bidding unit 203 is configured to perform declaration bidding on the agent model, and a market operating mechanism performs market clearing calculation according to the declaration bidding; the training unit 204 is configured to train a synchronous value network for the proxy model until an end condition is met; wherein the end condition includes: the maximum number of learning is reached or the market has reached a state of equilibrium. The system can solve the problem that the traditional RL algorithm is difficult to effectively simulate the actual bidding behavior of a generator, and improve the accuracy and the rationality of the bidding behavior of the agent model.
Third embodiment of the invention:
embodiments of the present invention also provide a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the deep Q network-based generator bidding behavior simulation method according to any one of claims 1 to 5.
The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and substitutions can be made without departing from the technical principle of the present invention, and these modifications and substitutions should also be regarded as the protection scope of the present invention.
Claims (10)
1. A generator bidding behavior simulation method based on a deep Q network is characterized by comprising the following steps:
constructing a state space S, an action space A and a reward function; the state space S selects the power price of a time node, a time highest middle mark section and a time blocking condition of a connected line as state characteristics; the action space A is constructed based on a marginal cost curve; the reward function is obtained according to the power generation profit;
setting proxy model parameters and initializing the proxy model; wherein the parameters include: motion space parameter Amin、AmaxH; a state space dimension; exploring the probability; the number of structural layers of the current value network and the target value network, the number of neurons in each layer, an activation function and optimizer parameters; playback memory unit capacity; maximum learning times, current value network and target value network synchronization frequency tstep;
The agent model carries out declaration bidding, and a market operating mechanism carries out market clearing calculation according to the declaration bidding;
the agent model trains a synchronous value network until an ending condition is met; wherein the end condition includes: the maximum number of learning is reached or the market has reached a state of equilibrium.
2. The deep Q-network based power generator bidding behavior simulation method according to claim 1, wherein the marginal cost calculation formula is as follows:
CM(P)=a+2bP
in the formula, a and b are coefficients of a first term and a second term of the cost function respectively; p is the output of the unit;
each action is to multiply the marginal cost by a factor, A ∈ [ A ]min,Amax]Divided into increasing H equal parts, AminAnd AmaxThe minimum and maximum selectable coefficients, respectively. If the agent model selects the ith action, the corresponding coefficients are:
Ai=Amin+i/H*(Amax-Amin)
then its quoted price is:
CB=CMAi。
3. the deep Q network-based power generator bidding behavior simulation method according to claim 1, wherein the proxy model is initialized, specifically: the method specifically comprises the following steps: initializing a market environment state sequence as s according to the selected state features in the state space1And obtaining phi after max-min normalization pretreatment1=φ(s1) (ii) a Initializing a current value network weight parameter theta and enabling a target value network weight parameter theta-=θ。
4. The deep Q network-based power generator bidding behavior simulation method according to claim 1, wherein the agent model performs declaration bidding, and a market operator performs market clearing calculation according to the declaration bidding, specifically: choose-greedy exploration mode, i.e. randomly choosing action a with probabilitytOtherwise, select action at=argmaxaQ(φtA | θ); action atAfter determination, according to formula CB=CMAiCalculating to obtain a corresponding quotation strategy and reporting to a market operator organization; the market operating mechanism minimizes the power generation cost during one-sided quotation or maximizes social welfare during two-sided quotation based on the quotation information, market load, power grid topology and market rules of the marketAnd (4) taking the universalization as a clearing target, calculating the optimal power flow and giving related market clearing information.
5. The deep Q network-based generator bidding behavior simulation method according to claim 1, wherein the agent model trains a synchronous value network, specifically: according to the reward function rtAnd the next market environment state sequence st+1Simultaneously obtaining phi by max-min normalization processingt+1=φ(st+1) And storing the transfer sequence (phi)t,at,rt,φt+1) To a playback memory unit; the proxy model randomly samples a fixed number of transfer samples (phi) from the memory cellsj,aj,rj,φj+1) Calculating an optimization objective Y from the objective networkj=rj+γmaxa'Q(φj+1,a'|θ-) And calculating an error function (Y)j-Q(φj,aj|θ))2(ii) a Updating the current value of the network weight parameter theta by using a gradient descent method according to the error function, and simultaneously, updating the current value of the network weight parameter theta every tstepTime step synchronization target value network weight theta-θ; if the proxy model meets the end condition, ending the simulation, and calculating and outputting a final result; and if the agent model does not meet the end condition, returning to execute the agent model to declare a bid.
6. A generator bidding behavior simulation system based on a deep Q network is characterized by comprising: the system comprises a construction unit, an agent model processing unit, a declaration and bidding unit and a training unit; wherein the content of the first and second substances,
the construction unit is used for constructing a state space S, an action space A and a reward function; the state space S selects the power price of a time node, a time highest middle mark section and a time blocking condition of a connected line as state characteristics; the action space A is constructed based on a marginal cost curve; the reward function is obtained according to the power generation profit;
the proxy model processing unit sets proxy model parameters and processes the proxy modelLine initialization processing; wherein the parameters include: motion space parameter Amin、AmaxH; a state space dimension; exploring the probability; the number of structural layers of the current value network and the target value network, the number of neurons in each layer, an activation function and optimizer parameters; playback memory unit capacity; maximum learning times, current value network and target value network synchronization frequency tstep;
The declaration bidding unit is used for performing declaration bidding on the agent model, and the market operating mechanism performs market clearing calculation according to the declaration bidding;
the training unit is used for training the synchronous value network for the agent model until an ending condition is met; wherein the end condition includes: the maximum number of learning is reached or the market has reached a state of equilibrium.
7. The deep Q-network based power generator bidding behavior simulation system according to claim 6, wherein the marginal cost calculation formula is as follows:
CM(P)=a+2bP
in the formula, a and b are coefficients of a first term and a second term of the cost function respectively; p is the output of the unit;
each action is to multiply the marginal cost by a factor, A ∈ [ A ]min,Amax]Divided into increasing H equal parts, AminAnd AmaxThe minimum and maximum selectable coefficients, respectively. If the agent model selects the ith action, the corresponding coefficients are:
Ai=Amin+i/H*(Amax-Amin)
then its quoted price is:
CB=CMAi。
8. the deep Q network-based power generator bidding behavior simulation system according to claim 6, wherein the proxy model is initialized, specifically: the method specifically comprises the following steps: initializing market environment states according to selected state features in the state spaceThe sequence is s1And obtaining phi after max-min normalization pretreatment1=φ(s1) (ii) a Initializing a current value network weight parameter theta and enabling a target value network weight parameter theta-=θ。
9. The deep Q network-based power generator bidding behavior simulation system according to claim 6, wherein the agent model trains a synchronous value network, specifically: according to the reward function rtAnd the next market environment state sequence st+1Simultaneously obtaining phi by max-min normalization processingt+1=φ(st+1) And storing the transfer sequence (phi)t,at,rt,φt+1) To a playback memory unit; the proxy model randomly samples a fixed number of transfer samples (phi) from the memory cellsj,aj,rj,φj+1) Calculating an optimization objective Y from the objective networkj=rj+γmaxa'Q(φj+1,a'|θ-) And calculating an error function (Y)j-Q(φj,aj|θ))2(ii) a Updating the current value of the network weight parameter theta by using a gradient descent method according to the error function, and simultaneously, updating the current value of the network weight parameter theta every tstepTime step synchronization target value network weight theta-θ; if the proxy model meets the end condition, ending the simulation, and calculating and outputting a final result; and if the agent model does not meet the end condition, returning to execute the agent model to declare a bid.
10. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the deep Q network-based generator bid behavior simulation method according to any one of claims 1 to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010836213.3A CN112036936A (en) | 2020-08-19 | 2020-08-19 | Deep Q network-based generator bidding behavior simulation method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010836213.3A CN112036936A (en) | 2020-08-19 | 2020-08-19 | Deep Q network-based generator bidding behavior simulation method and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112036936A true CN112036936A (en) | 2020-12-04 |
Family
ID=73576884
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010836213.3A Pending CN112036936A (en) | 2020-08-19 | 2020-08-19 | Deep Q network-based generator bidding behavior simulation method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112036936A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
TWI779732B (en) * | 2021-07-21 | 2022-10-01 | 國立清華大學 | Method for renewable energy bidding using multiagent transfer reinforcement learning |
-
2020
- 2020-08-19 CN CN202010836213.3A patent/CN112036936A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
TWI779732B (en) * | 2021-07-21 | 2022-10-01 | 國立清華大學 | Method for renewable energy bidding using multiagent transfer reinforcement learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Tian et al. | Data driven parallel prediction of building energy consumption using generative adversarial nets | |
Kumar et al. | A hybrid multi-agent based particle swarm optimization algorithm for economic power dispatch | |
McCabe et al. | Optimizing the shape of a surge-and-pitch wave energy collector using a genetic algorithm | |
Mocanu et al. | Unsupervised energy prediction in a Smart Grid context using reinforcement cross-building transfer learning | |
Wang et al. | An evolutionary game approach to analyzing bidding strategies in electricity markets with elastic demand | |
CN108962238A (en) | Dialogue method, system, equipment and storage medium based on structural neural networks | |
Garg et al. | Symbolic network: generalized neural policies for relational MDPs | |
Ciomek et al. | Heuristics for prioritizing pair-wise elicitation questions with additive multi-attribute value models | |
CN113132232B (en) | Energy route optimization method | |
Li et al. | A hybrid deep interval prediction model for wind speed forecasting | |
CN116207739B (en) | Optimal scheduling method and device for power distribution network, computer equipment and storage medium | |
CN113255890A (en) | Reinforced learning intelligent agent training method based on PPO algorithm | |
CN116345578A (en) | Micro-grid operation optimization scheduling method based on depth deterministic strategy gradient | |
CN112036936A (en) | Deep Q network-based generator bidding behavior simulation method and system | |
Goh et al. | Hybrid SDS and WPT-IBBO-DNM Based Model for Ultra-short Term Photovoltaic Prediction | |
CN114239675A (en) | Knowledge graph complementing method for fusing multi-mode content | |
CN112580868A (en) | Power system transmission blocking management method, system, equipment and storage medium | |
CN115329985B (en) | Unmanned cluster intelligent model training method and device and electronic equipment | |
CN116992151A (en) | Online course recommendation method based on double-tower graph convolution neural network | |
Correia | Games with incomplete and asymmetric information in poolco markets | |
CN115577647A (en) | Power grid fault type identification method and intelligent agent construction method | |
Yang et al. | GNP-Sarsa with subroutines for trading rules on stock markets | |
Mustafa et al. | An application of genetic algorithm and least squares support vector machine for tracing the transmission loss in deregulated power system | |
Mustafa et al. | Transmission loss allocation in deregulated power system using the hybrid genetic algorithm-support vector machine technique | |
Di Camillo et al. | SimBioNeT: a simulator of biological network topology |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20201204 |