CN110276698A

CN110276698A - Distribution type renewable energy trade decision method based on the study of multiple agent bilayer cooperative reinforcing

Info

Publication number: CN110276698A
Application number: CN201910519858.1A
Authority: CN
Inventors: 王建春; 陈张宇; 刘�东; 黄玉辉; 孙健; 李峰; 殷小荣; 吉兰芳; 孙宏斌; 戴晖; 吴晓飞; 芦苇; 戴易见; 徐晓春; 李佑伟; 汤同峰
Original assignee: Shanghai Jiaotong University; HuaiAn Power Supply Co of State Grid Jiangsu Electric Power Co Ltd
Current assignee: Shanghai Jiaotong University; HuaiAn Power Supply Co of State Grid Jiangsu Electric Power Co Ltd
Priority date: 2019-06-17
Filing date: 2019-06-17
Publication date: 2019-09-24
Anticipated expiration: 2039-06-17
Also published as: CN110276698B

Abstract

The invention discloses a kind of distribution type renewable energy trade decision methods based on the study of multiple agent bilayer cooperative reinforcing, and the method includes following key steps: 1) constructing the double-deck Stochastic Decision-making Optimized model of distribution type renewable energy transaction；2) multiple agent bilayer cooperative reinforcing learning algorithm is introduced, according to the theoretical frame of multiple agent bilayer cooperative reinforcing learning algorithm, learning training is carried out, establishes function approximator and cooperative reinforcing study and work mechanism；3) estimated value of optimal Q value function is sought using iterative calculation method in the step 2) frame foundation；4) it using the multiple agent bilayer cooperative reinforcing learning algorithm solving optimization model trained, completes optimization and calculates.The present invention considers the uncertainty in distribution type renewable energy transaction, can be in the promotion Power Generation income for taking into account risk, while but also comprehensive benefit maximizes.

Description

Distributed renewable energy transaction decision method based on multi-agent double-layer collaborative reinforcement learning

Technical Field

The invention relates to the field of intelligent power distribution networks, in particular to a distributed renewable energy transaction decision method based on multi-agent double-layer collaborative reinforcement learning.

Background

With the progress and development of society, the global demand for green, clean and efficient power is larger and larger, more and more distributed renewable energy sources are connected to a power distribution network, and the distributed energy sources have the characteristics of reasonable energy efficiency utilization, small loss, less pollution, flexible operation, good system economy and the like. The development mainly has the problems of grid connection, power supply quality, capacity storage, fuel supply and the like.

Distributed photovoltaic and wind power generation, while free of fuel costs, are high in construction, operating and maintenance costs. At present, the new energy distributed generator in China is mainly subsidized for profit through the electricity price of the state and local governments. However, as distributed power penetration increases, the profitability model is significantly less consistent with market laws. The distributed generators are subsidized through the subscription fee of the users, the generators can be helped to participate in market competition, and reasonable quotation is carried out according to the potential benefits and the power generation cost of the generators, so that the social benefits are improved to the maximum extent. Meanwhile, various uncertain information such as power generator quotation, distributed power supply output fluctuation, user subscription and the like are considered, model solution can be carried out through a multi-agent double-layer collaborative reinforcement learning solution method, an optimal scheduling decision can be rapidly calculated, risks are reduced, and economic benefits are improved.

Disclosure of Invention

In order to overcome the defects of the existing transaction decision method, the invention provides a distributed renewable energy transaction decision method based on multi-agent double-layer collaborative reinforcement learning, a distributed energy double-layer random planning model under various uncertain information such as power generator quotation, distributed power supply output fluctuation, user subscription and the like is considered, model solution is carried out through a multi-agent double-layer collaborative reinforcement learning solution method, the optimal scheduling decision can be rapidly calculated, the risk is reduced, and the economic benefit is improved.

The invention realizes the aim through the following technical scheme:

a distributed renewable energy transaction decision method based on multi-agent double-layer collaborative reinforcement learning comprises the following steps:

step 1) constructing a double-layer random decision optimization model of distributed renewable energy trading;

step 2) introducing a multi-agent double-layer collaborative reinforcement learning algorithm, carrying out learning training according to a theoretical framework of the multi-agent double-layer collaborative reinforcement learning algorithm, and establishing a function approximator and a collaborative reinforcement learning working mechanism; the function approximator estimates a Q value by adopting a series of adjustable parameters and characteristics extracted from a state action space, the approximator establishes mapping of the state action of a function from the parameter space to the Q value to the space, the mapping can be linear or nonlinear, solvability can be analyzed by utilizing linear mapping, and the typical form of the function approximator is as follows:

whereinIs an adjustable approximate parameter vector and is characterized in that,is the feature vector of the state-action pair,is the Basis Function (BF) (. DEG)^TRepresenting a matrix transpose operation;

step 3) solving an estimation value of an optimal Q value function by using an iterative calculation method on the basis of the frame in the step 2);

and 4) solving an optimization model by using the trained multi-agent double-layer collaborative reinforcement learning algorithm to complete optimization calculation.

Preferably, the double-layer random decision optimization model for the distributed renewable energy transaction in step 1) includes an upper-layer planning modeling and a lower-layer planning modeling, which respectively correspond to two parts of an energy transaction link.

Preferably, the opportunity constraint programming for maximizing the optimistic value of the objective function constructed in the upper-level programming modeling has the optimization goal of maximum economic benefit, the constraint condition is composed of objective constraint limits and opportunity constraint limits, and the mathematical expression of the upper-level programming modeling is as follows:

constraint function:

wherein λ -power generation trade time-sharing quotation, wherein λ^tIs the bid at time t, ξ -a random variable that is not known to be caused by the bidder's bid,random variables caused by uncertainty of deviation of real values and predicted values of wind power and photovoltaic,when the quote is lambda, atξ andgenerator revenue in the scenario, β — bearing risk confidence,-meeting expected yield at β confidence, q^t,ξξ, the power generator obtained in the lower layer planning marks the power amount in the time period t,under the scene of- ξ, user new energy subscription compensation (lower layer decision output) of unit electric quantity obtained by lower layer decision, c_base-the cost per unit of electricity generation,the power generator is ξPenalty fines in the scene, gamma-unit fines for outstanding electricity,time t, ξ scenario with exceeding of the amount of electricityThe unbalanced electric quantity of the maximum output under the scene,-atAnd in the scene, at the moment T, the actual output upper limit of the distributed power supply is T-one time period, and the default value is one hour.

Preferably, the lower-layer planning modeling is used for optimizing scheduling and allocating bid amount in each power generator aiming at a bidding scenario and with market operation comprehensive benefits as a target, and the mathematical expression of the lower-layer planning is as follows:

constraint function:

in the formula: n is a radical of_pv、N_wp-the total number of photovoltaic and wind power generators in the area, L-the total number of power consumers in the area,-unit cost of purchasing electricity from the external grid at time t,the electricity purchasing cost from No. i photovoltaic and wind power generators at the moment t,-t time point purchasing electric power from the external grid,purchasing electricity from No. i photovoltaic and wind power generators at the moment t,-load of No. i electricity consumer at time t, comp_pv、comp_wp-per-degree electricity subscription compensation paid in renewable energy sources such as photovoltaic, wind power and the like within the user subscription range, Q_load-pv-i、Q_load-wp-iPhotovoltaic and wind power generation system for I number user to settle accounts on current daySubscription of electric quantity, Q_pv、Q_wp、Q_grid-photovoltaic, wind-electric, external electric quantity, upsilon, consumed in the area of the day_pv、υ_wp-ratio of photovoltaic to wind power generation in the area of the day, α_i、β_i-the photovoltaic and wind power ratio subscribed by the i-th user,and (4) the maximum generating capacity at the time t reported by the No. i photovoltaic and wind power generator.

Preferably, in the step 2), a plurality of agents are utilized to respectively process the randomness problem of the upper-layer planning modeling and the lower-layer planning modeling and the mutual iteration of the upper-layer planning modeling and the lower-layer planning modeling; the double-layer collaborative reinforcement algorithm introduces a diffusion strategy in the reinforcement learning process, and introduces an adaptive combination (ATC) mechanism into the reinforcement learning algorithm, and the collaborative reinforcement learning algorithm can adapt to randomness and uncertainty caused by distributed renewable energy sources and can adapt to the problem of complex calculation of a double-layer random decision optimization model; in addition, to avoid the storage of a large number of Q value tables, a function approximator is used to record the Q values of complex continuous states and motion spaces.

Preferably, the diffusion strategy can achieve faster convergence and can achieve a lower mean square deviation than a uniform strategy, which is as follows:

whereinIs an intermediate term, x, introduced by the diffusion strategy_i(k +1) is the state updated by combining all intermediate terms of agent i; n is a radical of_iIs a set of points adjacent to agent i; b_ijIs the weight assigned by agent i to neighboring agent j; defining a matrix B ═ B_ij]∈R^n×nAs a topological matrix of the microgrid communication network; the topology matrix B is a random matrix, B1_n＝1_nIn which 1 is_n∈RⁿIs a unit vector.

Has the advantages that:

1. the double-layer decision optimization model established by the invention can comprehensively consider the uncertainty situation caused by the random variables and make better decisions. It is therefore well suited for optimization decisions for distributed generators.

2. The algorithm provided by the invention is a double-layer collaborative reinforcement learning algorithm, can be well integrated into a two-layer random decision optimization model, and provides a new idea for intensive energy trading decision of a future information network and an energy network.

3. The invention introduces a plurality of agents to respectively process the randomness problem of the upper and lower layers of planning and the mutual iteration of the upper and lower layers, so that the collaborative reinforcement learning algorithm is more suitable for the problem of the double-layer planning.

4. The multi-agent double-layer collaborative reinforcement learning is used as a multi-agent reinforcement learning algorithm with self-learning and collaborative learning capabilities, and is more suitable for solving the large-scale distributed access energy problem with strong randomness and uncertainty. After certain training and updating, the algorithm can quickly carry out dynamic optimization, and meanwhile, the stability of global convergence is guaranteed.

5. A diffusion strategy is introduced in the reinforcement learning process, so that distributed information exchange can be realized in the microgrid, the calculation cost is reduced, faster convergence can be realized, and the mean square deviation lower than that of a consistent strategy can be achieved.

Drawings

FIG. 1 is an overall frame diagram of the present invention;

FIG. 2 is a flow chart of multi-agent dual-tier collaborative reinforcement learning according to the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings and specific embodiments.

The distributed renewable energy trading decision method based on multi-agent double-layer collaborative reinforcement learning disclosed by the invention takes a power distribution network as a medium, simultaneously schedules a distributed power supply and a controllable load, and realizes economic benefit optimization, and the optimization object and model of the method are schematically shown in figure 1.

The invention provides a distributed renewable energy transaction decision method based on multi-agent double-layer collaborative reinforcement learning, which comprises the following steps:

step 2) introducing a multi-agent double-layer collaborative reinforcement learning algorithm, carrying out learning training according to a theoretical framework of the multi-agent double-layer collaborative reinforcement learning algorithm, and establishing a function approximator and a collaborative reinforcement learning working mechanism;

The double-layer random decision optimization model for the distributed renewable energy transaction in the step 1) comprises an upper-layer planning modeling and a lower-layer planning modeling, which respectively correspond to two parts of an energy transaction link

In the step 2), a plurality of agents are used for respectively processing the randomness problem of the upper-layer planning modeling and the lower-layer planning modeling and the mutual iteration of the upper-layer planning modeling and the lower-layer planning modeling; the double-layer collaborative reinforcement algorithm introduces a diffusion strategy in the reinforcement learning process, and introduces an adaptive combination (ATC) mechanism into the reinforcement learning algorithm, and the collaborative reinforcement learning algorithm can adapt to randomness and uncertainty caused by distributed renewable energy sources and can adapt to the problem of complex calculation of a double-layer random decision optimization model; to avoid the storage of a large number of Q-value tables, a function approximator is used to record the Q-values of complex continuous states and motion spaces.

The step 3) iterative computation flow comprises the following steps (see fig. 2):

s1 initialization theta₀，ω₀

S2, repeating the times k is 1to T

S3, each agent calculates i-1 to n in turn

S4 calculating the feature vectorAnd state s_i(k)

S5 selecting action a according to strategy pi_i(k)

S6 observing the prize value r_i(k)

TD error delta S7_i(k)

S8 estimation

S9, updating the parameter theta_i(k)，ω_i(k)

S10 Return to S3

S11 Return to S2

S12: and returning the result.

The basic steps and explanation of the application of the distributed renewable energy framework of multi-agent double-layer collaborative reinforcement learning are as follows:

a1: decomposing and writing the target function and the constraint function of the upper and lower layer plans into respective rewards of a reinforcement learning algorithm to serve as reference values of rewards, wherein the target function of the upper layer plan is expected to be the maximum and is set as forward rewards, the target function of the lower layer plan is expected to be the lowest in price and is set as reverse rewards, the constraint conditions of the upper and lower layer plans are used as penalty items, coefficients are set according to actual debugging conditions, the requirement is that the penalty coefficient of the strong constraint is far greater than the Reward item coefficient, and the weak constraint is greater than the Reward item coefficient.

A2: the method comprises the steps of constructing a first reinforcement learning module which is essentially a combination of two (usually a plurality of) reinforcement learning intelligent agents, establishing a reinforcement learning intelligent agent unit by taking a lower-layer plan as a module, establishing a reinforcement learning intelligent agent unit by taking each power generator as a module at an upper layer due to a plurality of power generators, and finally integrating the intelligent agent unit at the upper layer and the intelligent agent unit at the lower layer through a whole intelligent agent unit, wherein as shown in an intelligent agent II in figure 1, the Reward structure of the intelligent agent II is that the maximum total Reward of each intelligent agent unit is the maximum target.

A3: and establishing a function approximator. The storage of the Q value occupies a large amount of resources of the computer, so as to reduce the occupation of the computer resources and increase the calculation speed.

A4: establishing a cooperative reinforcement learning working mechanism, and establishing a parameter updating process of integrating an adaptive combination (ATC) diffusion strategy into Greedy-GQ in order to accelerate the calculation efficiency of a multi-agent.

A5: and constructing a second reinforcement learning module, taking the agent II as an environment of the agent, and establishing an updating strategy by using a conventional Q learning (or Sarsar, DQN and the like) updating rule.

Modeling upper layer planning:

and the opportunity constraint planning of the optimistic value of the maximized objective function constructed in the upper-layer planning aims at maximizing economic benefit, and the constraint condition consists of objective constraint limit and opportunity constraint limit. Moreover, the upper layer optimization aims at an optimistic value of the economic benefit (i.e. the economic benefit obtained is better than the value at a certain confidence) to minimize the operation cost of the distribution network. The objective constraint limits are constraint conditions aiming at the deterministic objects and comprise unit power generation cost, unit unfinished power generation amount fine, upper and lower limits of actual processing of the distributed power supply and the like. The opportunity constraint limit is a constraint condition aiming at the distribution network uncertainty object, and comprises a probability constraint for bearing risk execution degree, a power flow safety limit and the like. Sources of uncertainty factors include uncertainty of distributed photovoltaic, wind power output, generator bid, uncertainty of traditional load forecast deviation, and the like.

Therefore, the mathematical expression of the upper-level planning modeling is as follows:

constraint function:

in the formula

lambda-Power Generation trade time-of-sale quotes, where lambda^tIs a quote at time t

ξ random variable caused by unknown price quote of bidder

-random variables caused by uncertainty of deviation of real values and predicted values of wind power and photovoltaic

When the offer is λ, ξ andpower generator revenue under scene

β -confidence of bearing risk

Satisfaction of expected yield at β confidence

q^t,ξξ scenario, the power generator obtained in the lower layer planning bid amount in the time period t

ξ scene, unit electric quantity user new energy subscription compensation obtained by lower layer decision (lower layer decision output)

c_base-cost per unit of electricity generation

-power generators at ξ andpenalty fines under scene

Gamma-unit fine of unfinished electricity

Time t, ξ scenario with power charge exceededUnbalanced electric quantity of maximum output under scene

-atActual output upper limit of distributed power supply at time t under scene

T-a period of time, with a default value of one hour.

Modeling of a lower layer plan:

and the lower-layer planning optimizes the scheduling and the allocation of the right to bid of each power plant by taking the comprehensive benefits of market operation as a target. The lower level programming model is actually a market-balanced scheduling model for the regional retail market. The accuracy of the model determines whether the regional market can function properly according to the rules. Due to the neglect of energy storage, the electricity purchase sources in the region include both the distributed generator and the external grid, and the sum of the electricity purchase cost of each period constitutes the cost source of the system. In addition, considering that the user is willing to pay a certain cost to order new energy and enjoy green power, this user group may also be included in the overall benefit. Therefore, the optimal goal can be to minimize the cost of electricity purchase and increase the subscription fees for green electricity.

Therefore, the mathematical expression of the underlying plan modeling is as follows:

constraint function:

in the formula:

N_pv、N_wp-total number of photovoltaic and wind power generation suppliers in the area

L-total number of power consumers in area

-unit cost of purchasing electricity from external grid at time t

-the electricity purchasing cost from No. i photovoltaic and wind power generator at the moment t

-purchasing electric power from external grid at time t

-purchasing electricity from No. i photovoltaic and wind power generator at time t

Load capacity of No. i power consumer at time t

comp_pv、comp_wpThe subscription range of the user is per-degree electricity subscription compensation paid in renewable energy sources such as photovoltaic energy, wind power energy and the like

Q_load-pv-i、Q_load-wp-iThe photovoltaic and wind power subscription electric quantity which is paid by the i-number user in the current day settlement

Q_pv、Q_wp、Q_grid-photovoltaic, wind power, external electrical quantities consumed within the area of the day

υ_pv、υ_wp-ratio of photovoltaic to wind power generation in the region of the same day

α_i、β_i-photovoltaic and wind power ratio subscribed by No. i user

-maximum generated energy at time t reported by No. i photovoltaic and wind power generator

A function approximator:

the function approximator estimates the Q value using a series of adjustable parameters and features extracted from the state action space. The approximator then builds a mapping of the state contribution from the parameter space to the Q-value function to the space. The mapping may be linear or non-linear. Solvability may be analyzed using linear mapping. A typical form of a linear approximator is as follows:

whereinIs an adjustable approximate parameter vector and is characterized in that,is the feature vector of the state-action pair, which can be derived from the following equation:

whereinIs a Basis Function (BF), such as gaussian radial BF, centered at a selected motionless point in the state space. Typically, the BFs sets corresponding to fixed points are evenly distributed in the state space. Herein, all vectors are considered to be column vectors if not specified. (.)^TRepresenting a matrix transpose operation. Radial basis function neural networks have been used in random nonlinear interconnect systems and have been shown to have good generalization performance.

Diffusion strategy:

the reinforcement learning algorithm introduces a diffusion strategy in the reinforcement learning process, and introduces an adaptive combination (ATC) mechanism into the reinforcement learning algorithm. The diffusion strategy may achieve faster convergence and may achieve a lower mean square deviation than the uniform strategy. Furthermore, the flooding strategy has better response performance to continuous real-time signals and is insensitive to neighboring weights. The basic idea of the flooding strategy is to combine collaboration items based on neighboring states during the self-state update of each agent. Consider having state x_iAnd its dynamic characteristics.

x_i(k+1)＝x_i(k)+f(x_i(k))

The diffusion strategy is as follows:

whereinIs an intermediate term, x, introduced by the diffusion strategy_i(k +1) is the state updated by combining all the intermediate terms of agent i. N is a radical of_iIs a set of points adjacent to agent i. In addition, b_ijIs the weight assigned by agent i to neighboring agent j. Here, we can define a matrix B ═ B_ij]∈R^n×nAs a topology matrix of the microgrid communication network. In general, the topology matrix B is a random matrix, which means B1_n＝1_nIn which 1 is_n∈RⁿIs a unit vector.

A collaborative reinforcement learning algorithm is provided by integrating an adaptive combination (ATC) diffusion strategy into the parameter updating process of Greeny-GQ.

It is noted that the proposed cooperative reinforcement learning algorithm introduces two intermediate vectors:andactual approximation parameter vector θ_i(k +1) and correction parameter vector ω_iIn the proposed algorithm, the learning rate parameters α (k) and β (k) can be set with the conditions P (1) to P (4).

α(k)＞0,β(k)＞0 P(1)

α(k)/β(k)→0 P(4)

Although the present invention has been described in detail with reference to the above embodiments, those skilled in the art can make modifications and equivalents to the embodiments of the present invention without departing from the spirit and scope of the present invention, which is set forth in the claims of the present application.

Claims

1. A distributed renewable energy transaction decision method based on multi-agent double-layer collaborative reinforcement learning is characterized by comprising the following steps:

step 2) introducing a multi-agent double-layer collaborative reinforcement learning algorithm, carrying out learning training according to a theoretical framework of the multi-agent double-layer collaborative reinforcement learning algorithm, and establishing a function approximator; the function approximator estimates a Q value by adopting a series of adjustable parameters and characteristics extracted from a state action space, the approximator establishes mapping of the state action of a function from the parameter space to the Q value to the space, the mapping can be linear or nonlinear, solvability can be analyzed by utilizing linear mapping, and the typical form of the function approximator is as follows:

whereinIs an adjustable approximate parameter vector and is characterized in that,is the feature vector of the state-action pair,is the Basis Function (BF); (.)^TRepresenting a matrix transpose operation;

2. The distributed renewable energy transaction decision method based on multi-agent double-layer collaborative reinforcement learning according to claim 1, characterized in that: the distributed renewable energy trading double-layer random decision optimization model in the step 1) comprises an upper-layer planning modeling and a lower-layer planning modeling, and the upper-layer planning modeling and the lower-layer planning modeling respectively correspond to two parts of an energy trading link.

3. The distributed renewable energy transaction decision method based on multi-agent double-layer collaborative reinforcement learning according to claim 2, characterized in that: the opportunity constraint planning of the optimistic value of the maximized objective function constructed in the upper-layer planning modeling has the optimization target of the maximum economic benefit, the constraint condition consists of objective constraint limit and opportunity constraint limit, and the mathematical expression of the upper-layer planning modeling is as follows:

constraint function:

wherein lambda is a time-shared quote for the generator, wherein lambda^tIs the quote at time t, ξ -a random variable that is not known to the bidder,random variables caused by uncertainty of deviation of real values and predicted values of wind power and photovoltaic,when the quote is λ, ξ withGenerator revenue in the scenario, β -confidence in risk,-meeting expected revenue at β confidence, q^t,ξξ, the power generator obtained in the lower layer planning is marked with power in the time period t,under the scene of- ξ, the new energy subscription compensation of the unit electric quantity user obtained by lower layer decision, c_base-the cost per unit of electricity generation,the generator at ξ andpenalty fines in the scene, gamma-unit fines for outstanding electricity,time t, ξ scenario with power exceedingThe unbalanced electric quantity of the maximum output under the scene,-atUnder the scene, at the actual output upper limit of the distributed power supply at the moment T, T-one time period, the default value is one hour.

4. The distributed renewable energy transaction decision method based on multi-agent double-layer collaborative reinforcement learning according to claim 2, characterized in that: the lower-layer planning modeling is used for optimizing scheduling and distributing bid amount of each power generator aiming at bidding scenes and taking market operation comprehensive benefits as targets, and the mathematical expression of the lower-layer planning is as follows:

constraint function:

in the formula: n is a radical of_pv、N_wp-the total number of photovoltaic, wind power generators in the area, L-the total number of power consumers in the area,-unit cost of purchasing electricity from the external grid at time t,the electricity purchasing cost from No. i photovoltaic and wind power generators at the moment t,-t time purchasing electric power from an external grid,purchasing electricity from No. i photovoltaic and wind power generation suppliers at the moment t,-load of No. i electricity consumer at time t, comp_pv、comp_wp-subscriber subscriptionThe range is per-degree electricity subscription compensation paid in renewable energy sources such as photovoltaic energy, wind power energy and the like, Q_load-pv-i、Q_load-wp-iI number user settles the charge-receivable photovoltaic and wind power subscription electric quantity on the same day, Q_pv、Q_wp、Q_grid-photovoltaic, wind-electric, external electric quantity, upsilon, consumed in the area of the day_pv、υ_wp-ratio of photovoltaic to wind power generation in the area of the day, α_i、β_i-the photovoltaic and wind power ratio subscribed by the i-th user,and (4) the maximum generating capacity at the time t reported by the No. i photovoltaic and wind power generator.

5. The distributed renewable energy transaction decision method based on multi-agent double-layer collaborative reinforcement learning according to claim 1, characterized in that: in the step 2), a plurality of agents are used for respectively processing the randomness problem of the upper-layer planning modeling and the lower-layer planning modeling and the mutual iteration of the upper-layer planning modeling and the lower-layer planning modeling; the double-layer collaborative reinforcement learning algorithm introduces a diffusion strategy in the reinforcement learning process, and introduces a self-adaptive combined ATC mechanism into the reinforcement learning algorithm, and the double-layer collaborative reinforcement learning algorithm can adapt to randomness and uncertainty caused by distributed renewable energy sources and can adapt to the problem of complex calculation of a double-layer random decision optimization model; to avoid the storage of a large number of Q-value tables, a function approximator is used to record the Q-values of complex continuous states and motion spaces.

6. The distributed renewable energy transaction decision method based on multi-agent double-layer collaborative reinforcement learning according to claim 5, characterized in that: the diffusion strategy can achieve faster convergence and can achieve lower mean square deviation than the uniform strategy, and the diffusion strategy is as follows:

wherein,is an intermediate term, x, introduced by the diffusion strategy_i(k +1) is the state updated by combining all intermediate terms of agent i; n is a radical of_iIs a set of points adjacent to agent i; b_ijIs the weight assigned by agent i to neighboring agent j; here, a matrix B ═ B is defined_ij]∈R^n×nAs a topological matrix of the microgrid communication network; the topology matrix B is a random matrix, B1_n＝1_nIn which 1 is_n∈RⁿIs a unit vector.