CN112036936A

CN112036936A - Deep Q network-based generator bidding behavior simulation method and system

Info

Publication number: CN112036936A
Application number: CN202010836213.3A
Authority: CN
Inventors: 张翔; 尚楠; 黄国日; 陈政; 辜炜德; 宋艺航
Original assignee: Energy Development Research Institute of China Southern Power Grid Co Ltd
Current assignee: Energy Development Research Institute of China Southern Power Grid Co Ltd
Priority date: 2020-08-19
Filing date: 2020-08-19
Publication date: 2020-12-04

Abstract

The invention discloses a method and a system for simulating bidding behaviors of a power generator based on a deep Q network, wherein the method comprises the following steps: constructing a state space S, an action space A and a reward function; setting proxy model parameters and initializing the proxy model; wherein the parameters include: motion space parameter A_min、A_maxH; a state space dimension; exploring the probability; the number of structural layers of the current value network and the target value network, the number of neurons in each layer, an activation function and optimizer parameters; the agent model carries out declaration bidding, and a market operating mechanism carries out market clearing calculation according to the declaration bidding; the agent model trains a synchronous value network until an ending condition is met; wherein the end condition includes: reach the maximum learning times or reach the equilibrium state in the marketState. The method can solve the problem that the traditional RL algorithm is difficult to effectively simulate the actual bidding behavior of the power generator, and improve the accuracy and the rationality of the bidding behavior of the agent model.

Description

Deep Q network-based generator bidding behavior simulation method and system

Technical Field

The invention relates to the technical field of power markets, in particular to a method and a system for simulating bidding behaviors of a power generator based on a deep Q network.

Background

The power market simulation technology mainly comprises two methods of experimental economics and agent-based computational economics. The experimental economics mainly simulate the quotation behaviors of power generation members in the real market through the decision and performance of testers participating in the experiment through reasonable experimental design, but the experimental economics are limited by the number of the participants and the cognitive level of the market, the randomness of experimental results is high, and the relationship with the overall market needs to be demonstrated. The computational economics method embeds the intelligent agent model into the electric power market simulation research framework, and makes quotation decision by using an artificial intelligence method. Thus, in contrast, the agent-based computational economics approach is favored by researchers. The generator agent model is the basis and the difficulty of the market simulation technology based on computational economics, the result of the model not only influences the clearing result of the market simulation, but also determines the rationality of the market dynamic simulation result due to the rationality.

At present, a generator bidding simulation algorithm based on intelligent agents at home and abroad has already obtained a certain research result, but most of the research results are focused on a traditional Reinforced Learning (RL) algorithm; for example, the following several proxy models: (1) a generator agent model is established based on a generation countermeasure network (GAN), and bidding behaviors of the generator agent model are mined from historical and simulation data, but the model ignores decision-making capability of an actual generator;

(2) a multi-input decision factor agent model of the power generator is established, the dynamic behavior evolution process of the power generator under the condition of load demand change is simulated, however, the adopted VRE algorithm is a learning model of a one-dimensional environment, so the effectiveness of the model on the observation and perception of the market environment needs to be verified;

(3) a generator bidding decision program module of the Q learning algorithm is developed, and the Q learning algorithm has stronger exploration capacity, but is limited by the limitation of the traditional Q learning algorithm, the model can only process discrete and low-dimensional market environment, and the situations of line blockage, historical bid winning and the like are not considered during decision making.

However, the actual power market is a complex system, and the bidding behavior of the generator is more a problem of large-scale and continuous space Markov Decision Process (MDP), so that the traditional RL algorithm is difficult to effectively simulate the actual bidding behavior of the generator.

Disclosure of Invention

The purpose of the invention is: the method and the system for simulating the bidding behaviors of the power generator based on the deep Q network can solve the problem that the traditional RL algorithm is difficult to effectively simulate the actual bidding behaviors of the power generator, and improve the accuracy and the rationality of the bidding behaviors of the agent model.

In order to achieve the purpose, the invention provides a generator bidding behavior simulation method based on a deep Q network, which comprises the following steps:

constructing a state space S, an action space A and a reward function; the state space S selects the power price of a time node, a time highest middle mark section and a time blocking condition of a connected line as state characteristics; the action space A is constructed based on a marginal cost curve; the reward function is obtained according to the power generation profit;

setting proxy model parameters and initializing the proxy model; wherein the parameters include: motion space parameter A_min、A_maxH; a state space dimension; exploring the probability; the number of structural layers of the current value network and the target value network, the number of neurons in each layer, an activation function and optimizer parameters; playback memory unit capacity; maximum ofLearning times, current value network and target value network synchronization frequency t_step；

The agent model carries out declaration bidding, and a market operating mechanism carries out market clearing calculation according to the declaration bidding;

the agent model trains a synchronous value network until an ending condition is met; wherein the end condition includes: the maximum number of learning is reached or the market has reached a state of equilibrium.

Further, the marginal cost calculation formula is as follows:

C_M(P)＝a+2bP

in the formula, a and b are coefficients of a first term and a second term of the cost function respectively; p is the output of the unit;

each action is to multiply the marginal cost by a factor, A ∈ [ A ]_min,A_max]Divided into increasing H equal parts, A_minAnd A_maxThe minimum and maximum selectable coefficients, respectively. If the agent model selects the ith action, the corresponding coefficients are:

A_i＝A_min+i/H*(A_max-A_min)

then its quoted price is:

C_B＝C_MA_i。

further, the initializing the proxy model specifically includes: the method specifically comprises the following steps: initializing a market environment state sequence as s according to the selected state features in the state space₁And obtaining phi after max-min normalization pretreatment₁＝φ(s₁) (ii) a Initializing a current value network weight parameter theta and enabling a target value network weight parameter theta^-＝θ。

Further, the agent model performs declaration bidding, and the market operating mechanism performs market clearing calculation according to the declaration bidding, specifically: choose-greedy exploration mode, i.e. randomly choosing action a with probability_tOtherwise, select action a_t＝argmax_aQ(φ_tA | θ); action a_tAfter determination, according to formula C_B＝C_MA_iComputingObtaining a corresponding quotation strategy and reporting the quotation strategy to a market operator organization; the market operation mechanism calculates the optimal trend and provides relevant market clearing information by taking the minimized power generation cost during single-side quotation or the maximized social welfare during double-side quotation as clearing targets based on the quotation information, the market load, the power grid topological structure and the market rules of the market.

Further, the agent model training synchronization value network specifically includes: according to the reward function r_tAnd the next market environment state sequence s_t+1Simultaneously obtaining phi by max-min normalization processing_t+1＝φ(s_t+1) And storing the transfer sequence (phi)_t,a_t,r_t,φ_t+1) To a playback memory unit; the proxy model randomly samples a fixed number of transfer samples (phi) from the memory cells_j,a_j,r_j,φ_j+1) Calculating an optimization objective Y from the objective network_j＝r_j+γmax_a'Q(φ_j+1,a'|θ^-) And calculating an error function (Y)_j-Q(φ_j,a_j|θ))²(ii) a Updating the current value of the network weight parameter theta by using a gradient descent method according to the error function, and simultaneously, updating the current value of the network weight parameter theta every t_stepTime step synchronization target value network weight theta^-θ; if the proxy model meets the end condition, ending the simulation, and calculating and outputting a final result; and if the agent model does not meet the end condition, returning to execute the agent model to declare a bid.

The embodiment of the invention also provides a generator bidding behavior simulation system based on the deep Q network, which comprises the following steps: the construction unit is used for constructing a state space S, an action space A and a reward function; the state space S selects the power price of a time node, a time highest middle mark section and a time blocking condition of a connected line as state characteristics; the action space A is constructed based on a marginal cost curve; the reward function is obtained according to the power generation profit;

the proxy model processing unit sets proxy model parameters and initializes the proxy model; wherein the parameters include: action spaceInter-parameter A_min、A_maxH; a state space dimension; exploring the probability; the number of structural layers of the current value network and the target value network, the number of neurons in each layer, an activation function and optimizer parameters; playback memory unit capacity; maximum learning times, current value network and target value network synchronization frequency t_step；

The declaration bidding unit is used for performing declaration bidding on the agent model, and the market operating mechanism performs market clearing calculation according to the declaration bidding;

the training unit is used for training the synchronous value network for the agent model until an ending condition is met; wherein the end condition includes: the maximum number of learning is reached or the market has reached a state of equilibrium.

Further, the marginal cost calculation formula is as follows:

C_M(P)＝a+2bP

A_i＝A_min+i/H*(A_max-A_min)

then its quoted price is:

C_B＝C_MA_i。

Further, the agent model training synchronization value network specifically includes: according to the rewardFunction r_tAnd the next market environment state sequence s_t+1Simultaneously obtaining phi by max-min normalization processing_t+1＝φ(s_t+1) And storing the transfer sequence (phi)_t,a_t,r_t,φ_t+1) To a playback memory unit; the proxy model randomly samples a fixed number of transfer samples (phi) from the memory cells_j,a_j,r_j,φ_j+1) Calculating an optimization objective Y from the objective network_j＝r_j+γmax_a'Q(φ_j+1,a'|θ^-) And calculating an error function (Y)_j-Q(φ_j,a_j|θ))²(ii) a Updating the current value of the network weight parameter theta by using a gradient descent method according to the error function, and simultaneously, updating the current value of the network weight parameter theta every t_stepTime step synchronization target value network weight theta^-θ; if the proxy model meets the end condition, ending the simulation, and calculating and outputting a final result; and if the agent model does not meet the end condition, returning to execute the agent model to declare a bid.

Embodiments of the present invention also provide a computer-readable storage medium storing a computer program which, when executed by a processor, implements the deep Q network-based generator bidding behavior simulation method according to any one of claims 1 to 5.

Compared with the prior art, the method and the system for simulating the bidding behavior of the power generator based on the deep Q network have the advantages that:

the invention discloses a method and a system for simulating bidding behaviors of a power generator based on a deep Q network, wherein the method comprises the following steps: constructing a state space S, an action space A and a reward function; the state space S selects the power price of a time node, a time highest middle mark section and a time blocking condition of a connected line as state characteristics; the action space A is constructed based on a marginal cost curve; the reward function is obtained according to the power generation profit; setting proxy model parameters and initializing the proxy model; wherein the parameters include: motion space parameter A_min、A_maxH; dimension of state spaceCounting; exploring the probability; the number of structural layers of the current value network and the target value network, the number of neurons in each layer, an activation function and optimizer parameters; playback memory unit capacity; maximum learning times, current value network and target value network synchronization frequency t_step(ii) a The agent model carries out declaration bidding, and a market operating mechanism carries out market clearing calculation according to the declaration bidding; the agent model trains a synchronous value network until an ending condition is met; wherein the end condition includes: the maximum number of learning is reached or the market has reached a state of equilibrium. The method can solve the problem that the traditional RL algorithm is difficult to effectively simulate the actual bidding behavior of the power generator, and improve the accuracy and the rationality of the bidding behavior of the agent model.

Drawings

Fig. 1 is a schematic flow chart of a generator bidding behavior simulation method based on a deep Q network according to a first embodiment of the present invention;

fig. 2 is a schematic diagram of a market clearing process in a deep Q network-based generator bidding behavior simulation method according to a first embodiment of the present invention;

fig. 3 is a schematic diagram of a training and synchronization value network in a deep Q network-based generator bidding behavior simulation method according to a first embodiment of the present invention;

fig. 4 is a schematic structural diagram of a generator bidding behavior simulation system based on a deep Q network according to a first embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be understood that the step numbers used herein are for convenience of description only and are not intended as limitations on the order in which the steps are performed.

It is to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

The terms "comprises" and "comprising" indicate the presence of the described features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

The term "and/or" refers to and includes any and all possible combinations of one or more of the associated listed items.

The first embodiment of the present invention:

referring to fig. 1 to fig. 3, a generator bidding behavior simulation method based on a deep Q network according to an embodiment of the present invention includes at least the following steps:

s101, constructing a state space S, an action space A and a reward function; the state space S selects the power price of a time node, a time highest middle mark section and a time blocking condition of a connected line as state characteristics; the action space A is constructed based on a marginal cost curve; the reward function is obtained according to the power generation profit;

for step S101, when the state space S is constructed, the selection of the environmental state features determines whether the agent model can effectively and correctly observe and perceive the market environment, which is an important basis for decision making. Therefore, the invention selects the electricity price of the time node, the section with the highest time middle mark and the time blocking condition of the connected line as the state characteristics.

For step S101, in order to facilitate the simulation of the characteristic that the price of the generator is monotonically increased when the action space a is constructed, the patent mainly constructs the action space of the agent model based on the marginal cost curve, and the marginal cost C_MThe calculation of (P) is as follows:

C_M(P)＝a+2bP (1)

in the formula, a and b are coefficients of a first term and a second term of the cost function respectively; and P is the unit output.

A_i＝A_min+i/H*(A_max-A_min)

then its quoted price is:

C_B＝C_MA_i。

for step S101, the reward function (i.e. the enhanced signal) is related to the subordinate objective of the power generator in making the bidding decision, and can be obtained directly or indirectly according to the market environment and the market result, such as the bid-winning price, the bid-winning amount, the power generation profit, and the power generation profit. For simplicity, the generating profit is mainly selected as the reward function of the agent model.

S102, setting proxy model parameters and initializing the proxy model; wherein the parameters include: motion space parameter A_min、A_maxH; a state space dimension; exploring the probability; the number of structural layers of the current value network and the target value network, the number of neurons in each layer, an activation function and optimizer parameters; playback memory unit capacity; maximum learning times, current value network and target value network synchronization frequency t_step；

For the object of the initialization process of step S102, there are mainly market environment state space and value network weight parameters. Wherein, the state characteristic selected according to the state space S and the action space A is used for initializing the market environment state sequence as S₁And obtaining phi after max-min normalization pretreatment₁＝φ(s₁). Initializing a current value network weight parameter theta and enabling a target value network weight parameter theta^-＝θ。

S103, the agent model carries out declaration bidding, and a market operating mechanism carries out market clearing calculation according to the declaration bidding;

for step S103, the agent model mainly selects a greedy exploration mode, that is, randomly selects an action a with a probability_tOtherwise, select action a_t＝argmax_aQ(φ_tAnd a | θ). Action a_tAfter determination, according to C_B＝C_MA_iCalculating to obtain a corresponding quotation strategy and reporting to a market operator organization; and the market operator organization carries out clearing calculation according to the quotation strategies and the market rules submitted by the agent models and publishes clearing information. The market clearing calculation is based on quoted price information, market load, a power grid topological structure and market rules of the market, the electricity generation cost minimization during single-side quoted price or the social welfare maximization during double-side quoted price is taken as a clearing target, the optimal power flow is calculated, and relevant market clearing information is given, wherein the clearing model essentially solves the calculation problems of the unit combination problem and the node marginal electricity price;

s104, training a synchronous value network by the agent model until an ending condition is met; wherein the end condition includes: the maximum number of learning is reached or the market has reached a state of equilibrium.

For step S104, the agent model obtains a reward function r according to market clearing information_tAnd the next market environment state sequence s_t+1. Meanwhile, obtaining phi by max-min normalization processing_t+1＝φ(s_t+1) And storing the transfer sequence (phi)_t,a_t,r_t,φ_t+1) To a playback memory unit; the proxy model randomly samples a fixed number of transition samples (phi) from the memory cells_j,a_j,r_j,φ_j+1) Calculating an optimization objective Y from the objective network_j＝r_j+γmax_a'Q(φ_j+1,a'|θ^-) And calculating an error function (Y)_j-Q(φ_j,a_j|θ))². Updating the current value of the network weight parameter theta by using a gradient descent method according to the error function, and simultaneously, updating the current value of the network weight parameter theta every t_stepTime step synchronization target value network weight theta^-θ. If the agent meets an end condition, such as reaching a maximum number of learning or the market has reachedIf the balance state is balanced (the bidding strategies of all the agent models are not changed), the simulation is finished, and the final result (the final bidding strategies of all the agent models, the income, the node price and other information) is calculated and output; if not, returning to the agent model to declare the bid until meeting the end condition.

In a certain embodiment of the present invention, the marginal cost calculation formula is as follows:

C_M(P)＝a+2bP

A_i＝A_min+i/H*(A_max-A_min)

then its quoted price is:

C_B＝C_MA_i。

in an embodiment of the present invention, the initializing the proxy model specifically includes: the method specifically comprises the following steps: initializing a market environment state sequence as s according to the selected state features in the state space₁And obtaining phi after max-min normalization pretreatment₁＝φ(s₁) (ii) a Initializing a current value network weight parameter theta and enabling a target value network weight parameter theta^-＝θ。

In one embodiment of the present invention, the agent model performs a bid for declaration, and the market operating mechanism performs a market clearing calculation according to the bid for declaration, specifically: choose-greedy exploration mode, i.e. randomly choosing action a with probability_tOtherwise, select action a_t＝argmax_aQ(φ_tA | θ); action a_tAfter determination, according to formula C_B＝C_MA_iCalculating to obtain a corresponding quotation strategy and reporting to a market operator organization; the market operating mechanism is based onThe method comprises the steps of calculating optimal tide and giving related market clearing information by taking the minimum power generation cost during single-side quotation or the maximum social welfare during double-side quotation as clearing targets according to quotation information, market load, a power grid topological structure and market rules of the market.

In an embodiment of the present invention, the training of the synchronous value network by the agent model specifically includes: according to the reward function r_tAnd the next market environment state sequence s_t+1Simultaneously obtaining phi by max-min normalization processing_t+1＝φ(s_t+1) And storing the transfer sequence (phi)_t,a_t,r_t,φ_t+1) To a playback memory unit; the proxy model randomly samples a fixed number of transfer samples (phi) from the memory cells_j,a_j,r_j,φ_j+1) Calculating an optimization objective Y from the objective network_j＝r_j+γmax_a'Q(φ_j+1,a'|θ^-) And calculating an error function (Y)_j-Q(φ_j,a_j|θ))²(ii) a Updating the current value of the network weight parameter theta by using a gradient descent method according to the error function, and simultaneously, updating the current value of the network weight parameter theta every t_stepTime step synchronization target value network weight theta^-θ; if the proxy model meets the end condition, ending the simulation, and calculating and outputting a final result; and if the agent model does not meet the end condition, returning to execute the agent model to declare a bid.

The embodiment of the invention provides a generator bidding behavior simulation method based on a deep Q network, which comprises the following steps: constructing a state space S, an action space A and a reward function; the state space S selects the power price of a time node, a time highest middle mark section and a time blocking condition of a connected line as state characteristics; the action space A is constructed based on a marginal cost curve; the reward function is obtained according to the power generation profit; setting proxy model parameters and initializing the proxy model; wherein the parameters include: motion space parameter A_min、A_maxH; a state space dimension; exploring the probability; number of structural layers of current value network and target value network, number of neurons in each layerAnd activating functions, optimizer parameters; playback memory unit capacity; maximum learning times, current value network and target value network synchronization frequency t_step(ii) a The agent model carries out declaration bidding, and a market operating mechanism carries out market clearing calculation according to the declaration bidding; the agent model trains a synchronous value network until an ending condition is met; wherein the end condition includes: the maximum number of learning is reached or the market has reached a state of equilibrium. The method can solve the problem that the traditional RL algorithm is difficult to effectively simulate the actual bidding behavior of the power generator, and improve the accuracy and the rationality of the bidding behavior of the agent model.

Second embodiment of the invention:

referring to fig. 4, a generator bidding behavior simulation system 200 based on a deep Q network according to an embodiment of the present invention includes: the system comprises a construction unit 201, an agent model processing unit 202, a bid application unit 203 and a training unit 204; wherein the content of the first and second substances,

the constructing unit 201 is configured to construct a state space S, an action space a, and a reward function; the state space S selects the power price of a time node, a time highest middle mark section and a time blocking condition of a connected line as state characteristics; the action space A is constructed based on a marginal cost curve; the reward function is obtained according to the power generation profit;

the proxy model processing unit 202 sets proxy model parameters and initializes the proxy model; wherein the parameters include: motion space parameter A_min、A_maxH; a state space dimension; exploring the probability; the number of structural layers of the current value network and the target value network, the number of neurons in each layer, an activation function and optimizer parameters; playback memory unit capacity; maximum learning times, current value network and target value network synchronization frequency t_step；

The declaration bidding unit 203 is configured to perform declaration bidding on the agent model, and a market operating mechanism performs market clearing calculation according to the declaration bidding;

the training unit 204 is configured to train a synchronous value network for the proxy model until an end condition is met; wherein the end condition includes: the maximum number of learning is reached or the market has reached a state of equilibrium.

C_M(P)＝a+2bP

A_i＝A_min+i/H*(A_max-A_min)

then its quoted price is:

C_B＝C_MA_i。

In an embodiment of the present invention, the training of the synchronous value network by the agent model specifically includes: according to the reward function r_tAnd the next market environment state sequence s_t+1Simultaneously obtaining phi by max-min normalization processing_t+1＝φ(s_t+1) And storing the transfer sequence (phi)_t,a_t,r_t,φ_t+1) To a playback memory unit; the proxy model randomly samples a fixed number of transfer samples (phi) from the memory cells_j,a_j,r_j,φ_j+1) Calculating an optimization objective Y from the objective network_j＝r_j+γmax_a'Q(φ_j+1,a'|θ^-) And calculateError function (Y)_j-Q(φ_j,a_j|θ))²(ii) a Updating the current value of the network weight parameter theta by using a gradient descent method according to the error function, and simultaneously, updating the current value of the network weight parameter theta every t_stepTime step synchronization target value network weight theta^-θ; if the proxy model meets the end condition, ending the simulation, and calculating and outputting a final result; and if the agent model does not meet the end condition, returning to execute the agent model to declare a bid.

The generator bidding behavior simulation system 200 based on the deep Q network provided by the embodiment of the invention comprises: the system comprises a construction unit 201, an agent model processing unit 202, a bid application unit 203 and a training unit 204; the construction unit 201 is configured to construct a state space S, an action space a, and a reward function; the state space S selects the power price of a time node, a time highest middle mark section and a time blocking condition of a connected line as state characteristics; the action space A is constructed based on a marginal cost curve; the reward function is obtained according to the power generation profit; the proxy model processing unit 202 sets proxy model parameters and initializes the proxy model; wherein the parameters include: motion space parameter A_min、A_maxH; a state space dimension; exploring the probability; the number of structural layers of the current value network and the target value network, the number of neurons in each layer, an activation function and optimizer parameters; playback memory unit capacity; maximum learning times, current value network and target value network synchronization frequency t_step(ii) a The declaration bidding unit 203 is configured to perform declaration bidding on the agent model, and a market operating mechanism performs market clearing calculation according to the declaration bidding; the training unit 204 is configured to train a synchronous value network for the proxy model until an end condition is met; wherein the end condition includes: the maximum number of learning is reached or the market has reached a state of equilibrium. The system can solve the problem that the traditional RL algorithm is difficult to effectively simulate the actual bidding behavior of a generator, and improve the accuracy and the rationality of the bidding behavior of the agent model.

Third embodiment of the invention:

embodiments of the present invention also provide a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the deep Q network-based generator bidding behavior simulation method according to any one of claims 1 to 5.

The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and substitutions can be made without departing from the technical principle of the present invention, and these modifications and substitutions should also be regarded as the protection scope of the present invention.

Claims

1. A generator bidding behavior simulation method based on a deep Q network is characterized by comprising the following steps:

setting proxy model parameters and initializing the proxy model; wherein the parameters include: motion space parameter A_min、A_maxH; a state space dimension; exploring the probability; the number of structural layers of the current value network and the target value network, the number of neurons in each layer, an activation function and optimizer parameters; playback memory unit capacity; maximum learning times, current value network and target value network synchronization frequency t_step；

2. The deep Q-network based power generator bidding behavior simulation method according to claim 1, wherein the marginal cost calculation formula is as follows:

C_M(P)＝a+2bP

A_i＝A_min+i/H*(A_max-A_min)

then its quoted price is:

C_B＝C_MA_i。

3. the deep Q network-based power generator bidding behavior simulation method according to claim 1, wherein the proxy model is initialized, specifically: the method specifically comprises the following steps: initializing a market environment state sequence as s according to the selected state features in the state space₁And obtaining phi after max-min normalization pretreatment₁＝φ(s₁) (ii) a Initializing a current value network weight parameter theta and enabling a target value network weight parameter theta^-＝θ。

4. The deep Q network-based power generator bidding behavior simulation method according to claim 1, wherein the agent model performs declaration bidding, and a market operator performs market clearing calculation according to the declaration bidding, specifically: choose-greedy exploration mode, i.e. randomly choosing action a with probability_tOtherwise, select action a_t＝argmax_aQ(φ_tA | θ); action a_tAfter determination, according to formula C_B＝C_MA_iCalculating to obtain a corresponding quotation strategy and reporting to a market operator organization; the market operating mechanism minimizes the power generation cost during one-sided quotation or maximizes social welfare during two-sided quotation based on the quotation information, market load, power grid topology and market rules of the marketAnd (4) taking the universalization as a clearing target, calculating the optimal power flow and giving related market clearing information.

5. The deep Q network-based generator bidding behavior simulation method according to claim 1, wherein the agent model trains a synchronous value network, specifically: according to the reward function r_tAnd the next market environment state sequence s_t+1Simultaneously obtaining phi by max-min normalization processing_t+1＝φ(s_t+1) And storing the transfer sequence (phi)_t,a_t,r_t,φ_t+1) To a playback memory unit; the proxy model randomly samples a fixed number of transfer samples (phi) from the memory cells_j,a_j,r_j,φ_j+1) Calculating an optimization objective Y from the objective network_j＝r_j+γmax_a'Q(φ_j+1,a'|θ^-) And calculating an error function (Y)_j-Q(φ_j,a_j|θ))²(ii) a Updating the current value of the network weight parameter theta by using a gradient descent method according to the error function, and simultaneously, updating the current value of the network weight parameter theta every t_stepTime step synchronization target value network weight theta^-θ; if the proxy model meets the end condition, ending the simulation, and calculating and outputting a final result; and if the agent model does not meet the end condition, returning to execute the agent model to declare a bid.

6. A generator bidding behavior simulation system based on a deep Q network is characterized by comprising: the system comprises a construction unit, an agent model processing unit, a declaration and bidding unit and a training unit; wherein the content of the first and second substances,

the construction unit is used for constructing a state space S, an action space A and a reward function; the state space S selects the power price of a time node, a time highest middle mark section and a time blocking condition of a connected line as state characteristics; the action space A is constructed based on a marginal cost curve; the reward function is obtained according to the power generation profit;

the proxy model processing unit sets proxy model parameters and processes the proxy modelLine initialization processing; wherein the parameters include: motion space parameter A_min、A_maxH; a state space dimension; exploring the probability; the number of structural layers of the current value network and the target value network, the number of neurons in each layer, an activation function and optimizer parameters; playback memory unit capacity; maximum learning times, current value network and target value network synchronization frequency t_step；

7. The deep Q-network based power generator bidding behavior simulation system according to claim 6, wherein the marginal cost calculation formula is as follows:

C_M(P)＝a+2bP

A_i＝A_min+i/H*(A_max-A_min)

then its quoted price is:

C_B＝C_MA_i。

8. the deep Q network-based power generator bidding behavior simulation system according to claim 6, wherein the proxy model is initialized, specifically: the method specifically comprises the following steps: initializing market environment states according to selected state features in the state spaceThe sequence is s₁And obtaining phi after max-min normalization pretreatment₁＝φ(s₁) (ii) a Initializing a current value network weight parameter theta and enabling a target value network weight parameter theta^-＝θ。

9. The deep Q network-based power generator bidding behavior simulation system according to claim 6, wherein the agent model trains a synchronous value network, specifically: according to the reward function r_tAnd the next market environment state sequence s_t+1Simultaneously obtaining phi by max-min normalization processing_t+1＝φ(s_t+1) And storing the transfer sequence (phi)_t,a_t,r_t,φ_t+1) To a playback memory unit; the proxy model randomly samples a fixed number of transfer samples (phi) from the memory cells_j,a_j,r_j,φ_j+1) Calculating an optimization objective Y from the objective network_j＝r_j+γmax_a'Q(φ_j+1,a'|θ^-) And calculating an error function (Y)_j-Q(φ_j,a_j|θ))²(ii) a Updating the current value of the network weight parameter theta by using a gradient descent method according to the error function, and simultaneously, updating the current value of the network weight parameter theta every t_stepTime step synchronization target value network weight theta^-θ; if the proxy model meets the end condition, ending the simulation, and calculating and outputting a final result; and if the agent model does not meet the end condition, returning to execute the agent model to declare a bid.

10. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the deep Q network-based generator bid behavior simulation method according to any one of claims 1 to 5.