CN112036936A - Deep Q network-based generator bidding behavior simulation method and system - Google Patents

Deep Q network-based generator bidding behavior simulation method and system Download PDF

Info

Publication number
CN112036936A
CN112036936A CN202010836213.3A CN202010836213A CN112036936A CN 112036936 A CN112036936 A CN 112036936A CN 202010836213 A CN202010836213 A CN 202010836213A CN 112036936 A CN112036936 A CN 112036936A
Authority
CN
China
Prior art keywords
network
market
bidding
max
value network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010836213.3A
Other languages
Chinese (zh)
Inventor
张翔
尚楠
黄国日
陈政
辜炜德
宋艺航
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Energy Development Research Institute of China Southern Power Grid Co Ltd
Original Assignee
Energy Development Research Institute of China Southern Power Grid Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Energy Development Research Institute of China Southern Power Grid Co Ltd filed Critical Energy Development Research Institute of China Southern Power Grid Co Ltd
Priority to CN202010836213.3A priority Critical patent/CN112036936A/en
Publication of CN112036936A publication Critical patent/CN112036936A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data
    • G06Q30/0206Price or cost determination based on market factors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0207Discounts or incentives, e.g. coupons or rebates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/0601Electronic shopping [e-shopping]
    • G06Q30/0611Request for offers or quotes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/06Electricity, gas or water supply

Landscapes

  • Business, Economics & Management (AREA)
  • Engineering & Computer Science (AREA)
  • Strategic Management (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • Development Economics (AREA)
  • Economics (AREA)
  • General Business, Economics & Management (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Marketing (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Game Theory and Decision Science (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Public Health (AREA)
  • Water Supply & Treatment (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Resources & Organizations (AREA)
  • Primary Health Care (AREA)
  • Tourism & Hospitality (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Supply And Distribution Of Alternating Current (AREA)

Abstract

The invention discloses a method and a system for simulating bidding behaviors of a power generator based on a deep Q network, wherein the method comprises the following steps: constructing a state space S, an action space A and a reward function; setting proxy model parameters and initializing the proxy model; wherein the parameters include: motion space parameter Amin、AmaxH; a state space dimension; exploring the probability; the number of structural layers of the current value network and the target value network, the number of neurons in each layer, an activation function and optimizer parameters; the agent model carries out declaration bidding, and a market operating mechanism carries out market clearing calculation according to the declaration bidding; the agent model trains a synchronous value network until an ending condition is met; wherein the end condition includes: reach the maximum learning times or reach the equilibrium state in the marketState. The method can solve the problem that the traditional RL algorithm is difficult to effectively simulate the actual bidding behavior of the power generator, and improve the accuracy and the rationality of the bidding behavior of the agent model.

Description

Deep Q network-based generator bidding behavior simulation method and system
Technical Field
The invention relates to the technical field of power markets, in particular to a method and a system for simulating bidding behaviors of a power generator based on a deep Q network.
Background
The power market simulation technology mainly comprises two methods of experimental economics and agent-based computational economics. The experimental economics mainly simulate the quotation behaviors of power generation members in the real market through the decision and performance of testers participating in the experiment through reasonable experimental design, but the experimental economics are limited by the number of the participants and the cognitive level of the market, the randomness of experimental results is high, and the relationship with the overall market needs to be demonstrated. The computational economics method embeds the intelligent agent model into the electric power market simulation research framework, and makes quotation decision by using an artificial intelligence method. Thus, in contrast, the agent-based computational economics approach is favored by researchers. The generator agent model is the basis and the difficulty of the market simulation technology based on computational economics, the result of the model not only influences the clearing result of the market simulation, but also determines the rationality of the market dynamic simulation result due to the rationality.
At present, a generator bidding simulation algorithm based on intelligent agents at home and abroad has already obtained a certain research result, but most of the research results are focused on a traditional Reinforced Learning (RL) algorithm; for example, the following several proxy models: (1) a generator agent model is established based on a generation countermeasure network (GAN), and bidding behaviors of the generator agent model are mined from historical and simulation data, but the model ignores decision-making capability of an actual generator;
(2) a multi-input decision factor agent model of the power generator is established, the dynamic behavior evolution process of the power generator under the condition of load demand change is simulated, however, the adopted VRE algorithm is a learning model of a one-dimensional environment, so the effectiveness of the model on the observation and perception of the market environment needs to be verified;
(3) a generator bidding decision program module of the Q learning algorithm is developed, and the Q learning algorithm has stronger exploration capacity, but is limited by the limitation of the traditional Q learning algorithm, the model can only process discrete and low-dimensional market environment, and the situations of line blockage, historical bid winning and the like are not considered during decision making.
However, the actual power market is a complex system, and the bidding behavior of the generator is more a problem of large-scale and continuous space Markov Decision Process (MDP), so that the traditional RL algorithm is difficult to effectively simulate the actual bidding behavior of the generator.
Disclosure of Invention
The purpose of the invention is: the method and the system for simulating the bidding behaviors of the power generator based on the deep Q network can solve the problem that the traditional RL algorithm is difficult to effectively simulate the actual bidding behaviors of the power generator, and improve the accuracy and the rationality of the bidding behaviors of the agent model.
In order to achieve the purpose, the invention provides a generator bidding behavior simulation method based on a deep Q network, which comprises the following steps:
constructing a state space S, an action space A and a reward function; the state space S selects the power price of a time node, a time highest middle mark section and a time blocking condition of a connected line as state characteristics; the action space A is constructed based on a marginal cost curve; the reward function is obtained according to the power generation profit;
setting proxy model parameters and initializing the proxy model; wherein the parameters include: motion space parameter Amin、AmaxH; a state space dimension; exploring the probability; the number of structural layers of the current value network and the target value network, the number of neurons in each layer, an activation function and optimizer parameters; playback memory unit capacity; maximum ofLearning times, current value network and target value network synchronization frequency tstep
The agent model carries out declaration bidding, and a market operating mechanism carries out market clearing calculation according to the declaration bidding;
the agent model trains a synchronous value network until an ending condition is met; wherein the end condition includes: the maximum number of learning is reached or the market has reached a state of equilibrium.
Further, the marginal cost calculation formula is as follows:
CM(P)=a+2bP
in the formula, a and b are coefficients of a first term and a second term of the cost function respectively; p is the output of the unit;
each action is to multiply the marginal cost by a factor, A ∈ [ A ]min,Amax]Divided into increasing H equal parts, AminAnd AmaxThe minimum and maximum selectable coefficients, respectively. If the agent model selects the ith action, the corresponding coefficients are:
Ai=Amin+i/H*(Amax-Amin)
then its quoted price is:
CB=CMAi
further, the initializing the proxy model specifically includes: the method specifically comprises the following steps: initializing a market environment state sequence as s according to the selected state features in the state space1And obtaining phi after max-min normalization pretreatment1=φ(s1) (ii) a Initializing a current value network weight parameter theta and enabling a target value network weight parameter theta-=θ。
Further, the agent model performs declaration bidding, and the market operating mechanism performs market clearing calculation according to the declaration bidding, specifically: choose-greedy exploration mode, i.e. randomly choosing action a with probabilitytOtherwise, select action at=argmaxaQ(φtA | θ); action atAfter determination, according to formula CB=CMAiComputingObtaining a corresponding quotation strategy and reporting the quotation strategy to a market operator organization; the market operation mechanism calculates the optimal trend and provides relevant market clearing information by taking the minimized power generation cost during single-side quotation or the maximized social welfare during double-side quotation as clearing targets based on the quotation information, the market load, the power grid topological structure and the market rules of the market.
Further, the agent model training synchronization value network specifically includes: according to the reward function rtAnd the next market environment state sequence st+1Simultaneously obtaining phi by max-min normalization processingt+1=φ(st+1) And storing the transfer sequence (phi)t,at,rtt+1) To a playback memory unit; the proxy model randomly samples a fixed number of transfer samples (phi) from the memory cellsj,aj,rjj+1) Calculating an optimization objective Y from the objective networkj=rj+γmaxa'Q(φj+1,a'|θ-) And calculating an error function (Y)j-Q(φj,aj|θ))2(ii) a Updating the current value of the network weight parameter theta by using a gradient descent method according to the error function, and simultaneously, updating the current value of the network weight parameter theta every tstepTime step synchronization target value network weight theta-θ; if the proxy model meets the end condition, ending the simulation, and calculating and outputting a final result; and if the agent model does not meet the end condition, returning to execute the agent model to declare a bid.
The embodiment of the invention also provides a generator bidding behavior simulation system based on the deep Q network, which comprises the following steps: the construction unit is used for constructing a state space S, an action space A and a reward function; the state space S selects the power price of a time node, a time highest middle mark section and a time blocking condition of a connected line as state characteristics; the action space A is constructed based on a marginal cost curve; the reward function is obtained according to the power generation profit;
the proxy model processing unit sets proxy model parameters and initializes the proxy model; wherein the parameters include: action spaceInter-parameter Amin、AmaxH; a state space dimension; exploring the probability; the number of structural layers of the current value network and the target value network, the number of neurons in each layer, an activation function and optimizer parameters; playback memory unit capacity; maximum learning times, current value network and target value network synchronization frequency tstep
The declaration bidding unit is used for performing declaration bidding on the agent model, and the market operating mechanism performs market clearing calculation according to the declaration bidding;
the training unit is used for training the synchronous value network for the agent model until an ending condition is met; wherein the end condition includes: the maximum number of learning is reached or the market has reached a state of equilibrium.
Further, the marginal cost calculation formula is as follows:
CM(P)=a+2bP
in the formula, a and b are coefficients of a first term and a second term of the cost function respectively; p is the output of the unit;
each action is to multiply the marginal cost by a factor, A ∈ [ A ]min,Amax]Divided into increasing H equal parts, AminAnd AmaxThe minimum and maximum selectable coefficients, respectively. If the agent model selects the ith action, the corresponding coefficients are:
Ai=Amin+i/H*(Amax-Amin)
then its quoted price is:
CB=CMAi
further, the initializing the proxy model specifically includes: the method specifically comprises the following steps: initializing a market environment state sequence as s according to the selected state features in the state space1And obtaining phi after max-min normalization pretreatment1=φ(s1) (ii) a Initializing a current value network weight parameter theta and enabling a target value network weight parameter theta-=θ。
Further, the agent model training synchronization value network specifically includes: according to the rewardFunction rtAnd the next market environment state sequence st+1Simultaneously obtaining phi by max-min normalization processingt+1=φ(st+1) And storing the transfer sequence (phi)t,at,rtt+1) To a playback memory unit; the proxy model randomly samples a fixed number of transfer samples (phi) from the memory cellsj,aj,rjj+1) Calculating an optimization objective Y from the objective networkj=rj+γmaxa'Q(φj+1,a'|θ-) And calculating an error function (Y)j-Q(φj,aj|θ))2(ii) a Updating the current value of the network weight parameter theta by using a gradient descent method according to the error function, and simultaneously, updating the current value of the network weight parameter theta every tstepTime step synchronization target value network weight theta-θ; if the proxy model meets the end condition, ending the simulation, and calculating and outputting a final result; and if the agent model does not meet the end condition, returning to execute the agent model to declare a bid.
Embodiments of the present invention also provide a computer-readable storage medium storing a computer program which, when executed by a processor, implements the deep Q network-based generator bidding behavior simulation method according to any one of claims 1 to 5.
Compared with the prior art, the method and the system for simulating the bidding behavior of the power generator based on the deep Q network have the advantages that:
the invention discloses a method and a system for simulating bidding behaviors of a power generator based on a deep Q network, wherein the method comprises the following steps: constructing a state space S, an action space A and a reward function; the state space S selects the power price of a time node, a time highest middle mark section and a time blocking condition of a connected line as state characteristics; the action space A is constructed based on a marginal cost curve; the reward function is obtained according to the power generation profit; setting proxy model parameters and initializing the proxy model; wherein the parameters include: motion space parameter Amin、AmaxH; dimension of state spaceCounting; exploring the probability; the number of structural layers of the current value network and the target value network, the number of neurons in each layer, an activation function and optimizer parameters; playback memory unit capacity; maximum learning times, current value network and target value network synchronization frequency tstep(ii) a The agent model carries out declaration bidding, and a market operating mechanism carries out market clearing calculation according to the declaration bidding; the agent model trains a synchronous value network until an ending condition is met; wherein the end condition includes: the maximum number of learning is reached or the market has reached a state of equilibrium. The method can solve the problem that the traditional RL algorithm is difficult to effectively simulate the actual bidding behavior of the power generator, and improve the accuracy and the rationality of the bidding behavior of the agent model.
Drawings
Fig. 1 is a schematic flow chart of a generator bidding behavior simulation method based on a deep Q network according to a first embodiment of the present invention;
fig. 2 is a schematic diagram of a market clearing process in a deep Q network-based generator bidding behavior simulation method according to a first embodiment of the present invention;
fig. 3 is a schematic diagram of a training and synchronization value network in a deep Q network-based generator bidding behavior simulation method according to a first embodiment of the present invention;
fig. 4 is a schematic structural diagram of a generator bidding behavior simulation system based on a deep Q network according to a first embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be understood that the step numbers used herein are for convenience of description only and are not intended as limitations on the order in which the steps are performed.
It is to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
The terms "comprises" and "comprising" indicate the presence of the described features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
The term "and/or" refers to and includes any and all possible combinations of one or more of the associated listed items.
The first embodiment of the present invention:
referring to fig. 1 to fig. 3, a generator bidding behavior simulation method based on a deep Q network according to an embodiment of the present invention includes at least the following steps:
s101, constructing a state space S, an action space A and a reward function; the state space S selects the power price of a time node, a time highest middle mark section and a time blocking condition of a connected line as state characteristics; the action space A is constructed based on a marginal cost curve; the reward function is obtained according to the power generation profit;
for step S101, when the state space S is constructed, the selection of the environmental state features determines whether the agent model can effectively and correctly observe and perceive the market environment, which is an important basis for decision making. Therefore, the invention selects the electricity price of the time node, the section with the highest time middle mark and the time blocking condition of the connected line as the state characteristics.
For step S101, in order to facilitate the simulation of the characteristic that the price of the generator is monotonically increased when the action space a is constructed, the patent mainly constructs the action space of the agent model based on the marginal cost curve, and the marginal cost CMThe calculation of (P) is as follows:
CM(P)=a+2bP (1)
in the formula, a and b are coefficients of a first term and a second term of the cost function respectively; and P is the unit output.
Each action is to multiply the marginal cost by a factor, A ∈ [ A ]min,Amax]Divided into increasing H equal parts, AminAnd AmaxThe minimum and maximum selectable coefficients, respectively. If the agent model selects the ith action, the corresponding coefficients are:
Ai=Amin+i/H*(Amax-Amin)
then its quoted price is:
CB=CMAi
for step S101, the reward function (i.e. the enhanced signal) is related to the subordinate objective of the power generator in making the bidding decision, and can be obtained directly or indirectly according to the market environment and the market result, such as the bid-winning price, the bid-winning amount, the power generation profit, and the power generation profit. For simplicity, the generating profit is mainly selected as the reward function of the agent model.
S102, setting proxy model parameters and initializing the proxy model; wherein the parameters include: motion space parameter Amin、AmaxH; a state space dimension; exploring the probability; the number of structural layers of the current value network and the target value network, the number of neurons in each layer, an activation function and optimizer parameters; playback memory unit capacity; maximum learning times, current value network and target value network synchronization frequency tstep
For the object of the initialization process of step S102, there are mainly market environment state space and value network weight parameters. Wherein, the state characteristic selected according to the state space S and the action space A is used for initializing the market environment state sequence as S1And obtaining phi after max-min normalization pretreatment1=φ(s1). Initializing a current value network weight parameter theta and enabling a target value network weight parameter theta-=θ。
S103, the agent model carries out declaration bidding, and a market operating mechanism carries out market clearing calculation according to the declaration bidding;
for step S103, the agent model mainly selects a greedy exploration mode, that is, randomly selects an action a with a probabilitytOtherwise, select action at=argmaxaQ(φtAnd a | θ). Action atAfter determination, according to CB=CMAiCalculating to obtain a corresponding quotation strategy and reporting to a market operator organization; and the market operator organization carries out clearing calculation according to the quotation strategies and the market rules submitted by the agent models and publishes clearing information. The market clearing calculation is based on quoted price information, market load, a power grid topological structure and market rules of the market, the electricity generation cost minimization during single-side quoted price or the social welfare maximization during double-side quoted price is taken as a clearing target, the optimal power flow is calculated, and relevant market clearing information is given, wherein the clearing model essentially solves the calculation problems of the unit combination problem and the node marginal electricity price;
s104, training a synchronous value network by the agent model until an ending condition is met; wherein the end condition includes: the maximum number of learning is reached or the market has reached a state of equilibrium.
For step S104, the agent model obtains a reward function r according to market clearing informationtAnd the next market environment state sequence st+1. Meanwhile, obtaining phi by max-min normalization processingt+1=φ(st+1) And storing the transfer sequence (phi)t,at,rtt+1) To a playback memory unit; the proxy model randomly samples a fixed number of transition samples (phi) from the memory cellsj,aj,rjj+1) Calculating an optimization objective Y from the objective networkj=rj+γmaxa'Q(φj+1,a'|θ-) And calculating an error function (Y)j-Q(φj,aj|θ))2. Updating the current value of the network weight parameter theta by using a gradient descent method according to the error function, and simultaneously, updating the current value of the network weight parameter theta every tstepTime step synchronization target value network weight theta-θ. If the agent meets an end condition, such as reaching a maximum number of learning or the market has reachedIf the balance state is balanced (the bidding strategies of all the agent models are not changed), the simulation is finished, and the final result (the final bidding strategies of all the agent models, the income, the node price and other information) is calculated and output; if not, returning to the agent model to declare the bid until meeting the end condition.
In a certain embodiment of the present invention, the marginal cost calculation formula is as follows:
CM(P)=a+2bP
in the formula, a and b are coefficients of a first term and a second term of the cost function respectively; p is the output of the unit;
each action is to multiply the marginal cost by a factor, A ∈ [ A ]min,Amax]Divided into increasing H equal parts, AminAnd AmaxThe minimum and maximum selectable coefficients, respectively. If the agent model selects the ith action, the corresponding coefficients are:
Ai=Amin+i/H*(Amax-Amin)
then its quoted price is:
CB=CMAi
in an embodiment of the present invention, the initializing the proxy model specifically includes: the method specifically comprises the following steps: initializing a market environment state sequence as s according to the selected state features in the state space1And obtaining phi after max-min normalization pretreatment1=φ(s1) (ii) a Initializing a current value network weight parameter theta and enabling a target value network weight parameter theta-=θ。
In one embodiment of the present invention, the agent model performs a bid for declaration, and the market operating mechanism performs a market clearing calculation according to the bid for declaration, specifically: choose-greedy exploration mode, i.e. randomly choosing action a with probabilitytOtherwise, select action at=argmaxaQ(φtA | θ); action atAfter determination, according to formula CB=CMAiCalculating to obtain a corresponding quotation strategy and reporting to a market operator organization; the market operating mechanism is based onThe method comprises the steps of calculating optimal tide and giving related market clearing information by taking the minimum power generation cost during single-side quotation or the maximum social welfare during double-side quotation as clearing targets according to quotation information, market load, a power grid topological structure and market rules of the market.
In an embodiment of the present invention, the training of the synchronous value network by the agent model specifically includes: according to the reward function rtAnd the next market environment state sequence st+1Simultaneously obtaining phi by max-min normalization processingt+1=φ(st+1) And storing the transfer sequence (phi)t,at,rtt+1) To a playback memory unit; the proxy model randomly samples a fixed number of transfer samples (phi) from the memory cellsj,aj,rjj+1) Calculating an optimization objective Y from the objective networkj=rj+γmaxa'Q(φj+1,a'|θ-) And calculating an error function (Y)j-Q(φj,aj|θ))2(ii) a Updating the current value of the network weight parameter theta by using a gradient descent method according to the error function, and simultaneously, updating the current value of the network weight parameter theta every tstepTime step synchronization target value network weight theta-θ; if the proxy model meets the end condition, ending the simulation, and calculating and outputting a final result; and if the agent model does not meet the end condition, returning to execute the agent model to declare a bid.
The embodiment of the invention provides a generator bidding behavior simulation method based on a deep Q network, which comprises the following steps: constructing a state space S, an action space A and a reward function; the state space S selects the power price of a time node, a time highest middle mark section and a time blocking condition of a connected line as state characteristics; the action space A is constructed based on a marginal cost curve; the reward function is obtained according to the power generation profit; setting proxy model parameters and initializing the proxy model; wherein the parameters include: motion space parameter Amin、AmaxH; a state space dimension; exploring the probability; number of structural layers of current value network and target value network, number of neurons in each layerAnd activating functions, optimizer parameters; playback memory unit capacity; maximum learning times, current value network and target value network synchronization frequency tstep(ii) a The agent model carries out declaration bidding, and a market operating mechanism carries out market clearing calculation according to the declaration bidding; the agent model trains a synchronous value network until an ending condition is met; wherein the end condition includes: the maximum number of learning is reached or the market has reached a state of equilibrium. The method can solve the problem that the traditional RL algorithm is difficult to effectively simulate the actual bidding behavior of the power generator, and improve the accuracy and the rationality of the bidding behavior of the agent model.
Second embodiment of the invention:
referring to fig. 4, a generator bidding behavior simulation system 200 based on a deep Q network according to an embodiment of the present invention includes: the system comprises a construction unit 201, an agent model processing unit 202, a bid application unit 203 and a training unit 204; wherein the content of the first and second substances,
the constructing unit 201 is configured to construct a state space S, an action space a, and a reward function; the state space S selects the power price of a time node, a time highest middle mark section and a time blocking condition of a connected line as state characteristics; the action space A is constructed based on a marginal cost curve; the reward function is obtained according to the power generation profit;
the proxy model processing unit 202 sets proxy model parameters and initializes the proxy model; wherein the parameters include: motion space parameter Amin、AmaxH; a state space dimension; exploring the probability; the number of structural layers of the current value network and the target value network, the number of neurons in each layer, an activation function and optimizer parameters; playback memory unit capacity; maximum learning times, current value network and target value network synchronization frequency tstep
The declaration bidding unit 203 is configured to perform declaration bidding on the agent model, and a market operating mechanism performs market clearing calculation according to the declaration bidding;
the training unit 204 is configured to train a synchronous value network for the proxy model until an end condition is met; wherein the end condition includes: the maximum number of learning is reached or the market has reached a state of equilibrium.
In a certain embodiment of the present invention, the marginal cost calculation formula is as follows:
CM(P)=a+2bP
in the formula, a and b are coefficients of a first term and a second term of the cost function respectively; p is the output of the unit;
each action is to multiply the marginal cost by a factor, A ∈ [ A ]min,Amax]Divided into increasing H equal parts, AminAnd AmaxThe minimum and maximum selectable coefficients, respectively. If the agent model selects the ith action, the corresponding coefficients are:
Ai=Amin+i/H*(Amax-Amin)
then its quoted price is:
CB=CMAi
in an embodiment of the present invention, the initializing the proxy model specifically includes: the method specifically comprises the following steps: initializing a market environment state sequence as s according to the selected state features in the state space1And obtaining phi after max-min normalization pretreatment1=φ(s1) (ii) a Initializing a current value network weight parameter theta and enabling a target value network weight parameter theta-=θ。
In an embodiment of the present invention, the training of the synchronous value network by the agent model specifically includes: according to the reward function rtAnd the next market environment state sequence st+1Simultaneously obtaining phi by max-min normalization processingt+1=φ(st+1) And storing the transfer sequence (phi)t,at,rtt+1) To a playback memory unit; the proxy model randomly samples a fixed number of transfer samples (phi) from the memory cellsj,aj,rjj+1) Calculating an optimization objective Y from the objective networkj=rj+γmaxa'Q(φj+1,a'|θ-) And calculateError function (Y)j-Q(φj,aj|θ))2(ii) a Updating the current value of the network weight parameter theta by using a gradient descent method according to the error function, and simultaneously, updating the current value of the network weight parameter theta every tstepTime step synchronization target value network weight theta-θ; if the proxy model meets the end condition, ending the simulation, and calculating and outputting a final result; and if the agent model does not meet the end condition, returning to execute the agent model to declare a bid.
The generator bidding behavior simulation system 200 based on the deep Q network provided by the embodiment of the invention comprises: the system comprises a construction unit 201, an agent model processing unit 202, a bid application unit 203 and a training unit 204; the construction unit 201 is configured to construct a state space S, an action space a, and a reward function; the state space S selects the power price of a time node, a time highest middle mark section and a time blocking condition of a connected line as state characteristics; the action space A is constructed based on a marginal cost curve; the reward function is obtained according to the power generation profit; the proxy model processing unit 202 sets proxy model parameters and initializes the proxy model; wherein the parameters include: motion space parameter Amin、AmaxH; a state space dimension; exploring the probability; the number of structural layers of the current value network and the target value network, the number of neurons in each layer, an activation function and optimizer parameters; playback memory unit capacity; maximum learning times, current value network and target value network synchronization frequency tstep(ii) a The declaration bidding unit 203 is configured to perform declaration bidding on the agent model, and a market operating mechanism performs market clearing calculation according to the declaration bidding; the training unit 204 is configured to train a synchronous value network for the proxy model until an end condition is met; wherein the end condition includes: the maximum number of learning is reached or the market has reached a state of equilibrium. The system can solve the problem that the traditional RL algorithm is difficult to effectively simulate the actual bidding behavior of a generator, and improve the accuracy and the rationality of the bidding behavior of the agent model.
Third embodiment of the invention:
embodiments of the present invention also provide a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the deep Q network-based generator bidding behavior simulation method according to any one of claims 1 to 5.
The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and substitutions can be made without departing from the technical principle of the present invention, and these modifications and substitutions should also be regarded as the protection scope of the present invention.

Claims (10)

1. A generator bidding behavior simulation method based on a deep Q network is characterized by comprising the following steps:
constructing a state space S, an action space A and a reward function; the state space S selects the power price of a time node, a time highest middle mark section and a time blocking condition of a connected line as state characteristics; the action space A is constructed based on a marginal cost curve; the reward function is obtained according to the power generation profit;
setting proxy model parameters and initializing the proxy model; wherein the parameters include: motion space parameter Amin、AmaxH; a state space dimension; exploring the probability; the number of structural layers of the current value network and the target value network, the number of neurons in each layer, an activation function and optimizer parameters; playback memory unit capacity; maximum learning times, current value network and target value network synchronization frequency tstep
The agent model carries out declaration bidding, and a market operating mechanism carries out market clearing calculation according to the declaration bidding;
the agent model trains a synchronous value network until an ending condition is met; wherein the end condition includes: the maximum number of learning is reached or the market has reached a state of equilibrium.
2. The deep Q-network based power generator bidding behavior simulation method according to claim 1, wherein the marginal cost calculation formula is as follows:
CM(P)=a+2bP
in the formula, a and b are coefficients of a first term and a second term of the cost function respectively; p is the output of the unit;
each action is to multiply the marginal cost by a factor, A ∈ [ A ]min,Amax]Divided into increasing H equal parts, AminAnd AmaxThe minimum and maximum selectable coefficients, respectively. If the agent model selects the ith action, the corresponding coefficients are:
Ai=Amin+i/H*(Amax-Amin)
then its quoted price is:
CB=CMAi
3. the deep Q network-based power generator bidding behavior simulation method according to claim 1, wherein the proxy model is initialized, specifically: the method specifically comprises the following steps: initializing a market environment state sequence as s according to the selected state features in the state space1And obtaining phi after max-min normalization pretreatment1=φ(s1) (ii) a Initializing a current value network weight parameter theta and enabling a target value network weight parameter theta-=θ。
4. The deep Q network-based power generator bidding behavior simulation method according to claim 1, wherein the agent model performs declaration bidding, and a market operator performs market clearing calculation according to the declaration bidding, specifically: choose-greedy exploration mode, i.e. randomly choosing action a with probabilitytOtherwise, select action at=argmaxaQ(φtA | θ); action atAfter determination, according to formula CB=CMAiCalculating to obtain a corresponding quotation strategy and reporting to a market operator organization; the market operating mechanism minimizes the power generation cost during one-sided quotation or maximizes social welfare during two-sided quotation based on the quotation information, market load, power grid topology and market rules of the marketAnd (4) taking the universalization as a clearing target, calculating the optimal power flow and giving related market clearing information.
5. The deep Q network-based generator bidding behavior simulation method according to claim 1, wherein the agent model trains a synchronous value network, specifically: according to the reward function rtAnd the next market environment state sequence st+1Simultaneously obtaining phi by max-min normalization processingt+1=φ(st+1) And storing the transfer sequence (phi)t,at,rtt+1) To a playback memory unit; the proxy model randomly samples a fixed number of transfer samples (phi) from the memory cellsj,aj,rjj+1) Calculating an optimization objective Y from the objective networkj=rj+γmaxa'Q(φj+1,a'|θ-) And calculating an error function (Y)j-Q(φj,aj|θ))2(ii) a Updating the current value of the network weight parameter theta by using a gradient descent method according to the error function, and simultaneously, updating the current value of the network weight parameter theta every tstepTime step synchronization target value network weight theta-θ; if the proxy model meets the end condition, ending the simulation, and calculating and outputting a final result; and if the agent model does not meet the end condition, returning to execute the agent model to declare a bid.
6. A generator bidding behavior simulation system based on a deep Q network is characterized by comprising: the system comprises a construction unit, an agent model processing unit, a declaration and bidding unit and a training unit; wherein the content of the first and second substances,
the construction unit is used for constructing a state space S, an action space A and a reward function; the state space S selects the power price of a time node, a time highest middle mark section and a time blocking condition of a connected line as state characteristics; the action space A is constructed based on a marginal cost curve; the reward function is obtained according to the power generation profit;
the proxy model processing unit sets proxy model parameters and processes the proxy modelLine initialization processing; wherein the parameters include: motion space parameter Amin、AmaxH; a state space dimension; exploring the probability; the number of structural layers of the current value network and the target value network, the number of neurons in each layer, an activation function and optimizer parameters; playback memory unit capacity; maximum learning times, current value network and target value network synchronization frequency tstep
The declaration bidding unit is used for performing declaration bidding on the agent model, and the market operating mechanism performs market clearing calculation according to the declaration bidding;
the training unit is used for training the synchronous value network for the agent model until an ending condition is met; wherein the end condition includes: the maximum number of learning is reached or the market has reached a state of equilibrium.
7. The deep Q-network based power generator bidding behavior simulation system according to claim 6, wherein the marginal cost calculation formula is as follows:
CM(P)=a+2bP
in the formula, a and b are coefficients of a first term and a second term of the cost function respectively; p is the output of the unit;
each action is to multiply the marginal cost by a factor, A ∈ [ A ]min,Amax]Divided into increasing H equal parts, AminAnd AmaxThe minimum and maximum selectable coefficients, respectively. If the agent model selects the ith action, the corresponding coefficients are:
Ai=Amin+i/H*(Amax-Amin)
then its quoted price is:
CB=CMAi
8. the deep Q network-based power generator bidding behavior simulation system according to claim 6, wherein the proxy model is initialized, specifically: the method specifically comprises the following steps: initializing market environment states according to selected state features in the state spaceThe sequence is s1And obtaining phi after max-min normalization pretreatment1=φ(s1) (ii) a Initializing a current value network weight parameter theta and enabling a target value network weight parameter theta-=θ。
9. The deep Q network-based power generator bidding behavior simulation system according to claim 6, wherein the agent model trains a synchronous value network, specifically: according to the reward function rtAnd the next market environment state sequence st+1Simultaneously obtaining phi by max-min normalization processingt+1=φ(st+1) And storing the transfer sequence (phi)t,at,rtt+1) To a playback memory unit; the proxy model randomly samples a fixed number of transfer samples (phi) from the memory cellsj,aj,rjj+1) Calculating an optimization objective Y from the objective networkj=rj+γmaxa'Q(φj+1,a'|θ-) And calculating an error function (Y)j-Q(φj,aj|θ))2(ii) a Updating the current value of the network weight parameter theta by using a gradient descent method according to the error function, and simultaneously, updating the current value of the network weight parameter theta every tstepTime step synchronization target value network weight theta-θ; if the proxy model meets the end condition, ending the simulation, and calculating and outputting a final result; and if the agent model does not meet the end condition, returning to execute the agent model to declare a bid.
10. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the deep Q network-based generator bid behavior simulation method according to any one of claims 1 to 5.
CN202010836213.3A 2020-08-19 2020-08-19 Deep Q network-based generator bidding behavior simulation method and system Pending CN112036936A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010836213.3A CN112036936A (en) 2020-08-19 2020-08-19 Deep Q network-based generator bidding behavior simulation method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010836213.3A CN112036936A (en) 2020-08-19 2020-08-19 Deep Q network-based generator bidding behavior simulation method and system

Publications (1)

Publication Number Publication Date
CN112036936A true CN112036936A (en) 2020-12-04

Family

ID=73576884

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010836213.3A Pending CN112036936A (en) 2020-08-19 2020-08-19 Deep Q network-based generator bidding behavior simulation method and system

Country Status (1)

Country Link
CN (1) CN112036936A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI779732B (en) * 2021-07-21 2022-10-01 國立清華大學 Method for renewable energy bidding using multiagent transfer reinforcement learning

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI779732B (en) * 2021-07-21 2022-10-01 國立清華大學 Method for renewable energy bidding using multiagent transfer reinforcement learning

Similar Documents

Publication Publication Date Title
Tian et al. Data driven parallel prediction of building energy consumption using generative adversarial nets
Kumar et al. A hybrid multi-agent based particle swarm optimization algorithm for economic power dispatch
McCabe et al. Optimizing the shape of a surge-and-pitch wave energy collector using a genetic algorithm
Mocanu et al. Unsupervised energy prediction in a Smart Grid context using reinforcement cross-building transfer learning
Wang et al. An evolutionary game approach to analyzing bidding strategies in electricity markets with elastic demand
CN108962238A (en) Dialogue method, system, equipment and storage medium based on structural neural networks
Garg et al. Symbolic network: generalized neural policies for relational MDPs
Ciomek et al. Heuristics for prioritizing pair-wise elicitation questions with additive multi-attribute value models
CN113132232B (en) Energy route optimization method
Li et al. A hybrid deep interval prediction model for wind speed forecasting
CN116207739B (en) Optimal scheduling method and device for power distribution network, computer equipment and storage medium
CN113255890A (en) Reinforced learning intelligent agent training method based on PPO algorithm
CN116345578A (en) Micro-grid operation optimization scheduling method based on depth deterministic strategy gradient
CN112036936A (en) Deep Q network-based generator bidding behavior simulation method and system
Goh et al. Hybrid SDS and WPT-IBBO-DNM Based Model for Ultra-short Term Photovoltaic Prediction
CN114239675A (en) Knowledge graph complementing method for fusing multi-mode content
CN112580868A (en) Power system transmission blocking management method, system, equipment and storage medium
CN115329985B (en) Unmanned cluster intelligent model training method and device and electronic equipment
CN116992151A (en) Online course recommendation method based on double-tower graph convolution neural network
Correia Games with incomplete and asymmetric information in poolco markets
CN115577647A (en) Power grid fault type identification method and intelligent agent construction method
Yang et al. GNP-Sarsa with subroutines for trading rules on stock markets
Mustafa et al. An application of genetic algorithm and least squares support vector machine for tracing the transmission loss in deregulated power system
Mustafa et al. Transmission loss allocation in deregulated power system using the hybrid genetic algorithm-support vector machine technique
Di Camillo et al. SimBioNeT: a simulator of biological network topology

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20201204