CN110276698B

CN110276698B - Distributed renewable energy transaction decision method based on multi-agent double-layer collaborative reinforcement learning

Info

Publication number: CN110276698B
Application number: CN201910519858.1A
Authority: CN
Inventors: 王建春; 陈张宇; 刘�东; 黄玉辉; 孙健; 李峰; 殷小荣; 吉兰芳; 孙宏斌; 戴晖; 吴晓飞; 芦苇; 戴易见; 徐晓春; 李佑伟; 汤同峰
Original assignee: HuaiAn Power Supply Co of State Grid Jiangsu Electric Power Co Ltd; Shanghai Jiao Tong University
Current assignee: HuaiAn Power Supply Co of State Grid Jiangsu Electric Power Co Ltd; Shanghai Jiao Tong University
Priority date: 2019-06-17
Filing date: 2019-06-17
Publication date: 2022-08-02
Anticipated expiration: 2039-06-17
Also published as: CN110276698A

Abstract

The invention discloses a distributed renewable energy transaction decision method based on multi-agent double-layer collaborative reinforcement learning, which comprises the following main steps of: 1) constructing a double-layer random decision optimization model of distributed renewable energy trading; 2) introducing a multi-agent double-layer collaborative reinforcement learning algorithm, carrying out learning training according to a theoretical framework of the multi-agent double-layer collaborative reinforcement learning algorithm, and establishing a function approximator and a collaborative reinforcement learning working mechanism; 3) calculating an estimated value of an optimal Q value function by using an iterative calculation method on the basis of the frame in the step 2); 4) and solving an optimization model by using the trained multi-agent double-layer collaborative reinforcement learning algorithm to complete optimization calculation. The invention considers the uncertainty in the distributed renewable energy transaction, can improve the income of the power generator in consideration of risks, and simultaneously maximizes the comprehensive benefit.

Description

Distributed renewable energy trading decision method based on multi-agent double-layer collaborative reinforcement learning

技术领域technical field

本发明涉及智能配电网领域，具体涉及一种基于多智能体双层协同强化学习的分布式可再生能源交易决策方法。The invention relates to the field of intelligent power distribution networks, in particular to a distributed renewable energy transaction decision-making method based on multi-agent double-layer collaborative reinforcement learning.

背景技术Background technique

随着社会的进步和发展，全球对绿色、清洁、高效电力的需求越来越大，越来越多的分布式可再生能源接入配电网，分布式能源具有能效利用合理、损耗小、污染少、运行灵活，系统经济性好等特点。发展主要存在并网、供电质量、容量储备、燃料供应等问题。With the progress and development of society, the global demand for green, clean and efficient power is increasing. More and more distributed renewable energy is connected to the distribution network. Distributed energy has the advantages of reasonable energy efficiency, low loss, It has the characteristics of less pollution, flexible operation and good system economy. The development mainly involves problems such as grid connection, power supply quality, capacity reserve, and fuel supply.

分布式光伏和风力发电虽然没有燃料成本，但其建设成本、运行成本和维护成本都很高。目前，我国新能源分布式发电机主要通过国家和地方政府电价补贴盈利。然而，随着分布式电力渗透率的提高，盈利模式明显不符合市场规律。通过用户订阅费对分布式发电机进行补贴，可以帮助发电机参与市场竞争，根据自身的潜在效益和发电成本合理报价，从而最大限度地提高社会效益。同时考虑发电商报价、分布式电源出力波动及用户认购等多种不确定信息，可以通过多智能体双层协同强化学习的求解方法进行模型求解，可以快速计算出最优调度决策，降低风险，提升了经济效益。Although distributed photovoltaic and wind power generation has no fuel cost, its construction cost, operation cost and maintenance cost are high. At present, my country's new energy distributed generators mainly make profits through national and local government electricity price subsidies. However, with the increase in the penetration rate of distributed power, the profit model obviously does not conform to the laws of the market. Subsidizing distributed generators through user subscription fees can help generators to participate in market competition and make reasonable quotations based on their own potential benefits and power generation costs, thereby maximizing social benefits. At the same time, considering various uncertain information such as generator quotations, distributed power output fluctuations, and user subscriptions, the model can be solved by the multi-agent double-layer collaborative reinforcement learning solution method, which can quickly calculate the optimal scheduling decision and reduce risks. Improve economic efficiency.

发明内容SUMMARY OF THE INVENTION

为了克服现有交易决策方法的缺陷，本发明提出了一种基于多智能体双层协同强化学习的分布式可再生能源交易决策方法，考虑发电商报价、分布式电源出力波动及用户认购等多种不确定信息下的分布式能源双层随机规划模型，通过多智能体双层协同强化学习的求解方法进行模型求解，可以快速计算出最优调度决策，降低风险，提升了经济效益。In order to overcome the shortcomings of the existing transaction decision-making methods, the present invention proposes a distributed renewable energy transaction decision-making method based on multi-agent double-layer collaborative reinforcement learning, which takes into account the quotations of power generators, the output fluctuation of distributed power sources, and user subscriptions. This is a two-layer stochastic programming model for distributed energy under uncertain information. The model is solved by a multi-agent two-layer collaborative reinforcement learning solution method, which can quickly calculate the optimal scheduling decision, reduce risks, and improve economic benefits.

本发明通过以下技术方案实现上述目的：The present invention realizes above-mentioned purpose through following technical scheme:

一种基于多智能体双层协同强化学习的分布式可再生能源交易决策方法，包括以下步骤：A distributed renewable energy transaction decision-making method based on multi-agent double-layer collaborative reinforcement learning, comprising the following steps:

步骤1)构建分布式可再生能源交易的双层随机决策优化模型；Step 1) Build a two-layer stochastic decision-making optimization model for distributed renewable energy trading;

步骤2)引入多智能体双层协同强化学习算法，根据多智能体双层协同强化学习算法的理论框架，进行学习训练，建立函数逼近器以及协同强化学习工作机制；所述函数逼近器采用一系列可调参数和从状态作用空间中提取的特征来估计Q值，逼近器建立从参数空间到Q值函数的状态作用对空间的映射，映射可以是线性的，也可以是非线性的，可以利用线性映射分析可解性，函数逼近器的典型形式如下：Step 2) Introduce a multi-agent double-layer collaborative reinforcement learning algorithm, carry out learning and training according to the theoretical framework of the multi-agent double-layer collaborative reinforcement learning algorithm, and establish a function approximator and a collaborative reinforcement learning working mechanism; the function approximator adopts a A series of adjustable parameters and features extracted from the state action space are used to estimate the Q value. The approximator establishes a mapping from the parameter space to the state action to space of the Q value function. The mapping can be linear or nonlinear. The linear map analyzes solvability, and the typical form of the function approximator is as follows:

其中

是一种可调近似参数向量，

是状态-动作对的特征向量，

是基函数(BF)，(·)^T表示矩阵转置操作；in

is a tunable approximate parameter vector,

is the eigenvector of the state-action pair,

is the basis function (BF), ( ) ^T represents the matrix transpose operation;

步骤3)在步骤2)所述框架基础上利用迭代计算方法求取最优Q值函数的估计值；Step 3) utilizes iterative calculation method to obtain the estimated value of the optimal Q value function on the basis of the frame described in step 2);

步骤4)利用训练完的多智能体双层协同强化学习算法求解优化模型，完成优化计算。Step 4) Use the trained multi-agent double-layer collaborative reinforcement learning algorithm to solve the optimization model, and complete the optimization calculation.

优选地，所述步骤1)中的分布式可再生能源交易的双层随机决策优化模型包括上层规划建模和下层规划建模，分别对应能源交易环节的两个部分。Preferably, the two-layer stochastic decision-making optimization model for distributed renewable energy trading in the step 1) includes upper-level planning modeling and lower-level planning modeling, respectively corresponding to two parts of the energy trading link.

优选地，所述上层规划建模中构建的最大化目标函数乐观值的机会约束规划，其优化目标为经济效益最大，约束条件由客观约束限制与机会约束限制组成，所述上层规划建模其数学表达式如下：Preferably, the opportunity-constrained programming that maximizes the optimistic value of the objective function constructed in the upper-level planning modeling, the optimization goal is to maximize economic benefits, and the constraints are composed of objective constraints and opportunity constraints, and the upper-level planning modeling The mathematical expression is as follows:

约束函数:Constraint function:

式中，λ—发电商分时报价，其中λ^t是时刻为t时的报价，ξ—竞价商报价未知造成的随机变量，

—风电、光伏真实值与预测值偏差不确定性造成的随机变量，

—报价为λ时，在ξ与

场景下的发电商收益，β—承受风险置信度，

—满足β置信度下的预期收益，q^t,ξ—ξ场景下，下层规划中得到的该发电商在t时段中标电量，

—ξ场景下，下层决策得到的单位电量用户新能源认购补偿(下层决策输出)，c_base—单位发电成本，

—发电商在ξ与

场景下的违约罚款，γ—未完成电量的单位罚款，

—t时刻，ξ场景中标电量超出

场景下最大出力的不平衡电量，

—在

场景下在t时刻分布式电源实际出力上限，T—一个时段，默认值为一小时。In the formula, λ—the time-sharing quotation of the generator, where λ ^t is the quotation at time t, ξ—the random variable caused by the unknown quotation of the bidder,

- Random variables caused by the uncertainty of the deviation between the actual value and the predicted value of wind power and photovoltaics,

—When quoted as λ, when ξ and

The income of the generator under the scenario, β—the confidence level of risk tolerance,

—Meet the expected revenue under the β confidence, q ^t,ξ —ξ scenario, the power generation company obtained in the lower-level planning won the bid electricity in the t period,

- In the scenario of ξ, the user's new energy subscription compensation per unit of electricity obtained by the lower-level decision (the output of the lower-level decision), c _base - unit power generation cost,

- Power generators are

The penalty for breach of contract in the scenario, γ—the unit penalty for uncompleted electricity,

-At time t, the standard power in the ξ scene exceeds

The unbalanced power with the maximum output in the scene,

-exist

In the scenario, the upper limit of the actual output of the distributed power supply at time t, T—a period, and the default value is one hour.

优选地，所述下层规划建模用于针对竞价场景，以市场运行综合效益为目标，优化调度，分配各发电商中标电量，所述下层规划其数学表达式如下：Preferably, the lower-level planning modeling is used for the bidding scenario, aiming at the comprehensive benefit of market operation, optimizing the scheduling, and allocating the bid-winning power of each generator. The mathematical expression of the lower-level planning is as follows:

约束函数:Constraint function:

式中：N_pv、N_wp—区域内光伏、风力发电商总数，L—区域内电力用户总数，

—t时刻从外电网购电的单位成本，

—t时刻从i号光伏、风力发电商购电成本，

—t时刻从外电网购电电量，

—t时刻从i号光伏、风力发电商购电电量，

—t时刻第i号电力用户的负荷量，comp_pv、comp_wp—用户认购范围为光伏、风电等可再生能源内付出的每度电认购补偿，Q_load-pv-i、Q_load-wp-i—i号用户当日结算下应付费的光伏、风电认购电量，Q_pv、Q_wp、Q_grid—当日区域内消耗的光伏、风电、外部电量，υ_pv、υ_wp—当日区域内光伏、风力发电比例，α_i、β_i—第i号用户认购的光伏、风电比例，

—i号光伏、风力发电商申报的t时刻最大发电量。In the formula: N _pv , N _wp - the total number of photovoltaic and wind power generators in the area, L - the total number of power users in the area,

—The unit cost of purchasing electricity from the external grid at time t,

—The cost of purchasing electricity from PV and wind power generators at time t,

- Purchase electricity from the external power grid at time t,

— Purchase electricity from photovoltaic and wind power generators at time t,

—The load of the ith power user at time t, comp _pv , comp _wp — The subscription scope of the user is the subscription compensation for each kilowatt-hour of electricity paid in renewable energy such as photovoltaic and wind power, Q _load-pv-i , Q _{load-wp- i} —the amount of photovoltaic and wind power subscription electricity payable under the settlement of the user on the day, Q _pv , Q _wp , Q _grid —the photovoltaic, wind power, and external electricity consumed in the area on the day, υ _pv , υ _wp —the photovoltaic and wind power in the area on the day Proportion of power generation, α _i , β _i — the ratio of photovoltaic and wind power subscribed by the i-th user,

- The maximum power generation at time t declared by PV and wind power generators.

优选地，所述步骤2)中利用多个智能体来分别处理上层规划建模与下层规划建模本身的随机性问题和上层规划建模与下层规划建模相互的迭代；双层协同强化算法在强化学习过程中引入扩散策略，将自适应组合(ATC)机制引入强化学习算法中，所述协同强化学习算法能适应分布式可再生能源带来的随机性和不确定性，且能适应双层随机决策优化模型计算复杂的问题；此外，为了避免大量Q值表的存储，使用函数逼近器来记录着复杂连续的状态和动作空间的Q值。Preferably, in the step 2), multiple agents are used to deal with the randomness problem of the upper-level planning modeling and the lower-level planning modeling itself, and the mutual iteration between the upper-level planning modeling and the lower-level planning modeling; the two-layer collaborative strengthening algorithm The diffusion strategy is introduced in the reinforcement learning process, and the adaptive combination (ATC) mechanism is introduced into the reinforcement learning algorithm. The collaborative reinforcement learning algorithm can adapt to the randomness and uncertainty brought by distributed renewable energy, and can adapt to dual The layer stochastic decision optimization model is computationally complex; in addition, in order to avoid the storage of a large number of Q-value tables, a function approximator is used to record the Q-values of complex continuous state and action spaces.

优选地，所述扩散策略可以实现更快的收敛，并且可以达到比一致策略更低的均方偏差，所述扩散策略如下：Preferably, the diffusion strategy can achieve faster convergence and can achieve a lower mean square deviation than the consensus strategy, and the diffusion strategy is as follows:

其中

是一个由扩散策略引入的中间项，x_i(k+1)是通过组合智能体i的所有中间项来更新的状态；N_i是与智能体i相邻点的集合；b_ij是由智能体i分配给相邻智能体j的权重；定义一个矩阵B＝[b_ij]∈R^n×n作为微电网通信网络的拓扑矩阵；拓扑矩阵B是一个随机矩阵，B1_n＝1_n，其中1_n∈Rⁿ是单位向量。in

is an intermediate term introduced by the diffusion strategy, _xi (k+1) is the state updated by combining all intermediate terms of agent _i ; _Ni is the set of points adjacent to agent i; The weight assigned by the body i to the adjacent agent j; define a matrix B=[b _ij ]∈R ^n×n as the topology matrix of the microgrid communication network; the topology matrix B is a random matrix, B1 _n =1 _n , where 1 _n ∈ R ⁿ is a unit vector.

有益效果：Beneficial effects:

1、本发明建立的双层决策优化模型能够综合考虑这些随机变量引起的不确定性情景，做出更好的决策。因此，它非常适合于分布式发电机的优化决策。1. The two-layer decision-making optimization model established by the present invention can comprehensively consider the uncertain scenarios caused by these random variables, and make better decisions. Therefore, it is very suitable for optimal decision-making of distributed generators.

2、本发明中提出的算法是一种双层协同强化学习算法，可以很好地集成到两层随机决策优化模型中，为未来信息网络和能源网络的集约化能源交易决策提供了新的思路。2. The algorithm proposed in the present invention is a two-layer collaborative reinforcement learning algorithm, which can be well integrated into the two-layer stochastic decision-making optimization model, and provides a new idea for the intensive energy trading decision of the future information network and energy network. .

3、本发明引入了多个智能体来分别处理上下层规划本身的随机性问题和上下层相互的迭代，使得协同强化学习算法更适用于双层规划问题中。3. The present invention introduces a plurality of agents to deal with the randomness of the upper and lower level planning itself and the mutual iteration of the upper and lower levels, so that the collaborative reinforcement learning algorithm is more suitable for the two-level planning problem.

4、多智能体双层协同增强学习作为一种具有自学习和协同学习能力的多智能体增强学习算法，更适合于解决具有较强随机性和不确定性的大规模分布式接入能量问题。经过一定的训练更新后，该算法能够快速进行动态优化，同时保证全局收敛的稳定性。4. Multi-agent double-layer collaborative reinforcement learning, as a multi-agent reinforcement learning algorithm with self-learning and collaborative learning capabilities, is more suitable for solving large-scale distributed access energy problems with strong randomness and uncertainty . After a certain training update, the algorithm can quickly perform dynamic optimization while ensuring the stability of global convergence.

5、在强化学习过程中引入了扩散策略，使得在微电网中可以实现分布式信息交换，同时降低了计算成本，可以实现更快的收敛，并且可以达到比一致策略更低的均方偏差。5. The diffusion strategy is introduced in the reinforcement learning process, which enables distributed information exchange in the microgrid while reducing the computational cost, enabling faster convergence, and achieving a lower mean square deviation than the consensus strategy.

附图说明Description of drawings

图1是本发明的整体框架图；Fig. 1 is the overall frame diagram of the present invention;

图2是本发明的多智能体双层协同强化学习流程图。FIG. 2 is a flowchart of the multi-agent double-layer collaborative reinforcement learning of the present invention.

具体实施方式Detailed ways

下方结合附图和具体实施例对本发明做进一步的描述。The present invention will be further described below with reference to the accompanying drawings and specific embodiments.

本发明所述的基于多智能体双层协同强化学习的分布式可再生能源交易决策方法，以配电网为媒介，同时调度分布式电源与可控负荷，实现经济效益优化，其优化对象、模型示意如图1所示。The distributed renewable energy transaction decision-making method based on multi-agent double-layer collaborative reinforcement learning described in the present invention uses the distribution network as a medium to simultaneously dispatch distributed power sources and controllable loads to achieve economic benefit optimization. The schematic diagram of the model is shown in Figure 1.

本发明提出了一种基于多智能体双层协同强化学习的分布式可再生能源交易决策方法，所述方法包括以下步骤：The present invention proposes a distributed renewable energy transaction decision-making method based on multi-agent double-layer collaborative reinforcement learning, and the method includes the following steps:

步骤2)引入多智能体双层协同强化学习算法，根据多智能体双层协同强化学习算法的理论框架，进行学习训练，建立函数逼近器以及协同强化学习工作机制；Step 2) Introduce a multi-agent double-layer collaborative reinforcement learning algorithm, carry out learning and training according to the theoretical framework of the multi-agent double-layer collaborative reinforcement learning algorithm, and establish a function approximator and a collaborative reinforcement learning working mechanism;

所述步骤1)中的分布式可再生能源交易的双层随机决策优化模型包括上层规划建模和下层规划建模，分别对应能源交易环节的两个部分The two-layer stochastic decision-making optimization model for distributed renewable energy trading in the step 1) includes upper-level planning modeling and lower-level planning modeling, respectively corresponding to two parts of the energy trading link

所述步骤2)中利用多个智能体来分别处理上层规划建模与下层规划建模本身的随机性问题和上层规划建模与下层规划建模相互的迭代；双层协同强化算法在强化学习过程中引入扩散策略，将自适应组合(ATC)机制引入强化学习算法中，所述协同强化学习算法能适应分布式可再生能源带来的随机性和不确定性，且能适应双层随机决策优化模型计算复杂的问题；为了避免大量Q值表的存储，使用函数逼近器来记录复杂连续的状态和动作空间的Q值。In the step 2), multiple agents are used to deal with the randomness of the upper-level planning modeling and the lower-level planning modeling itself, and the mutual iteration between the upper-level planning modeling and the lower-level planning modeling; the two-layer collaborative reinforcement algorithm is used in reinforcement learning. The diffusion strategy is introduced in the process, and the adaptive combination (ATC) mechanism is introduced into the reinforcement learning algorithm. The collaborative reinforcement learning algorithm can adapt to the randomness and uncertainty brought by distributed renewable energy, and can adapt to two-layer random decision-making. The optimization model is computationally complex; in order to avoid the storage of a large number of Q-value tables, a function approximator is used to record the Q-values of complex continuous state and action spaces.

所述步骤3)迭代计算流程包括以下步骤(参见图2)：The step 3) iterative calculation process includes the following steps (see Figure 2):

S1:初始化:θ₀，ω₀ S1: Initialization: θ ₀ , ω ₀

S2:重复次数k＝1to TS2: number of repetitions k = 1to T

S3:每个智能体依次计算i＝1to nS3: Each agent calculates i=1to n in turn

S4:计算特征向量

和状态s_i(k)S4: Calculate eigenvectors

and state s _i (k)

S5:根据策略π选取动作a_i(k)S5: Choose action a _i (k) according to policy π

S6:观测奖励值r_i(k)S6: Observe the reward value ri ( _k )

S7:TD误差δ_i(k)S7: TD error δ _i (k)

S8:估算

S8: Estimation

S9:更新参数θ_i(k)，ω_i(k)S9: Update parameters θ _i (k), ω _i (k)

S10:回到S3S10: Back to S3

S11:回到S2S11: Back to S2

S12：返回结果。S12: Return the result.

本发明所述的多智能体双层协同强化学习的分布式可再生能源框架应用基本步骤与解释如下：The basic steps and explanations of the distributed renewable energy framework application of the multi-agent double-layer collaborative reinforcement learning described in the present invention are as follows:

A1：将上下层规划的目标函数以及约束函数分解写入强化学习算法的各自Reward中，作为奖励的参考值，其中上层规划目标函数应得到最大期望，所以设置为正向奖励，下层规划目标函数希望价格最低，所以设置为反向奖励，上下层规划的约束条件作为惩罚项，系数根据实际调试情况进行设定，要求是强约束的惩罚系数要远远大于奖励项系数，弱约束大于奖励项系数即可。A1: Decompose the objective function and constraint function of the upper and lower planning into the respective Rewards of the reinforcement learning algorithm, as the reference value of the reward, in which the upper planning objective function should get the maximum expectation, so it is set as a positive reward, and the lower planning objective function I hope the price is the lowest, so it is set as a reverse reward, and the constraints of the upper and lower planning are used as the penalty item, and the coefficient is set according to the actual debugging situation. coefficient can be.

A2：构建第一个强化学习模块，其实质是两个(通常为多个)强化学习智能体的组合，将下层规划作为一个模块建立强化学习智能体单元，上层由于有多个发电商，每个发电商作为一个模块建立强化学习智能体单元，最后将上层的智能体单元和下层的智能体单元通过一整个智能体单元整合在一起，如图1中的智能体II所示，此时智能体II的Reward构成是以各个智能体单元总奖励最大为目标。A2: Build the first reinforcement learning module, which is essentially a combination of two (usually multiple) reinforcement learning agents. The lower layer planning is used as a module to build reinforcement learning agent units. Since there are multiple generators in the upper layer, each A power generator is used as a module to establish a reinforcement learning agent unit, and finally the upper-layer intelligent unit and the lower-layer intelligent unit are integrated through a whole intelligent unit, as shown in the agent II in Figure 1, at this time the intelligent The Reward composition of Body II aims to maximize the total reward of each agent unit.

A3：建立函数逼近器。Q值的存储会占用计算机大量的资源，为了减少计算机资源的占用，同时加快计算速度。A3: Build a function approximator. The storage of the Q value will take up a lot of computer resources, in order to reduce the occupation of computer resources and speed up the calculation.

A4：建立协同强化学习工作机制，为了加快多智能体之前的计算效率，建立了自适应组合(ATC)扩散策略集成到Greedy-GQ的参数更新过程。A4: Establish a collaborative reinforcement learning working mechanism. In order to speed up the computational efficiency before multi-agent, an adaptive combination (ATC) diffusion strategy is established to integrate the parameter update process of Greedy-GQ.

A5：构建第二个强化学习模块，以智能体II作为智能体的环境，利用常规的Q学习(或Sarsar、DQN等)更新规则来建立更新策略。A5: Construct a second reinforcement learning module, using Agent II as the environment of the agent, and using conventional Q-learning (or Sarsar, DQN, etc.) update rules to establish an update policy.

上层规划建模：Upper-level planning modeling:

上层规划中构建的最大化目标函数乐观值的机会约束规划，其优化目标为经济效益最大，约束条件由客观约束限制与机会约束限制组成。而且，上层优化以经济效益的乐观值为优化目标(即在特定的置信度下所获经济效益优于该值)使配网运行成本最低。所述客观约束限制为针对确定性对象的约束条件，包括单位发电成本、单位未完成发电量的罚款、分布式电源实际处理上下限等。所述机会约束限制为针对配网不确定性对象的约束条件，包括承受风险执行度的概率约束、潮流安全限制等。所述不确定性因素来源包括分布式光伏、风电出力、发电商出价的不确定性、传统负荷预测偏差不确定性等。The chance-constrained programming that maximizes the optimistic value of the objective function constructed in the upper-level planning, its optimization objective is to maximize economic benefits, and the constraints are composed of objective constraints and opportunity constraints. Moreover, the upper layer optimization takes the optimistic value of economic benefit as the optimization goal (that is, the economic benefit obtained is better than this value under a certain confidence level) to minimize the operating cost of the distribution network. The objective constraints are limited to constraints on deterministic objects, including unit power generation costs, fines per unit of uncompleted power generation, and upper and lower limits for the actual processing of distributed power sources. The opportunity constraint is limited to the constraint conditions for the uncertainty object of the distribution network, including the probability constraint of the execution degree of risk tolerance, the safety limit of the power flow, and the like. The sources of the uncertainty factors include distributed photovoltaics, wind power output, uncertainty of bids by power generators, uncertainty of traditional load forecast deviation, and the like.

所以，所述上层规划建模其数学表达式如下所示：Therefore, the mathematical expression of the upper-level programming modeling is as follows:

约束函数:Constraint function:

式中in the formula

λ——发电商分时报价，其中λ^t是时刻为t时的报价λ——The time-of-use quotation of the generator, where λ ^t is the quotation at time t

ξ——竞价商报价未知造成的随机变量ξ——Random variable caused by unknown bidder’s bid

——风电、光伏真实值与预测值偏差不确定性造成的随机变量

——Random variables caused by the uncertainty of the deviation between the actual value and the predicted value of wind power and photovoltaic power

——报价为λ时，在ξ与

场景下的发电商收益

——When quoted as λ, between ξ and

Power generator revenue in the scenario

β——承受风险置信度β - the confidence level of risk taking

——满足β置信度下的预期收益

- Satisfy the expected return under the β confidence level

q^t,ξ——ξ场景下，下层规划中得到的该发电商在t时段中标电量q ^t,ξ ——In the ξ scenario, the power generation company's winning bid power in the t period obtained from the lower-level planning

——ξ场景下，下层决策得到的单位电量用户新能源认购补偿(下层决策输出)

——In the ξ scenario, the user's new energy subscription compensation per unit of electricity obtained by the lower-level decision (the output of the lower-level decision)

c_base——单位发电成本c _base - unit power generation cost

——发电商在ξ与

场景下的违约罚款

——The power generator is in ξ and

Penalty for breach of contract

γ——未完成电量的单位罚款γ - unit fine for uncompleted electricity

——t时刻，ξ场景中标电量超出

场景下最大出力的不平衡电量

——At time t, the standard power in the ξ scene exceeds

The unbalanced power of the maximum output in the scene

——在

场景下在t时刻分布式电源实际出力上限

--exist

The upper limit of the actual output of the distributed power supply at time t in the scenario

T——一个时段，默认值为一小时。T - a time period, the default value is one hour.

下层规划建模：Lower-level planning modeling:

下层规划，以市场运作的综合效益为目标，优化各发电厂的调度和分配投标权。较低层次的编程模型实际上是区域零售市场的市场均衡调度模型。模型的准确性决定了区域市场是否能够按照规则正常运作。由于忽略了能量存储，该地区的电力购买源包括分布式发电机和外部电网两部分，每个时期的购电成本总和构成了系统的成本来源。此外，考虑到用户愿意支付一定的成本来订购新能源并享受绿色电力，这个用户群也可以包含在综合效益中。因此，最佳目标可以最小化购电成本并增加绿色电力的认购费用。The lower-level planning, aiming at the comprehensive benefit of market operation, optimizes the dispatching and allocation of bidding rights of each power plant. The lower-level programming model is actually a market equilibrium scheduling model for regional retail markets. The accuracy of the model determines whether the regional market can function normally according to the rules. Due to the neglect of energy storage, the power purchase sources in this region include distributed generators and external grids, and the sum of the power purchase costs for each period constitutes the cost source of the system. In addition, considering that users are willing to pay a certain cost to order new energy and enjoy green power, this user group can also be included in the overall benefit. Therefore, the optimal target can minimize the cost of electricity purchase and increase the subscription fee for green electricity.

所以，所述下层规划建模其数学表达式如下所示：Therefore, the mathematical expression of the lower-level programming modeling is as follows:

约束函数:Constraint function:

式中：where:

N_pv、N_wp——区域内光伏、风力发电商总数N _pv , N _wp - the total number of photovoltaic and wind power generators in the region

L——区域内电力用户总数L - the total number of electricity users in the area

——t时刻从外电网购电的单位成本

——The unit cost of purchasing electricity from the external grid at time t

——t时刻从i号光伏、风力发电商购电成本

——The cost of purchasing electricity from PV and wind power generators at time t

——t时刻从外电网购电电量

—— Purchase electricity from the external power grid at time t

——t时刻从i号光伏、风力发电商购电电量

——Purchasing electricity from photovoltaic and wind power generators at time t

——t时刻第i号电力用户的负荷量

——The load of the ith power user at time t

comp_pv、comp_wp——用户认购范围为光伏、风电等可再生能源内付出的每度电认购补偿comp _pv , comp _wp - the subscription scope of users is the subscription compensation for each kilowatt-hour of electricity paid in renewable energy sources such as photovoltaics and wind power

Q_load-pv-i、Q_load-wp-i——i号用户当日结算下应付费的光伏、风电认购电量Q _load-pv-i , Q _load-wp-i ——the amount of photovoltaic and wind power subscription electricity payable under the settlement of user i on the day

Q_pv、Q_wp、Q_grid——当日区域内消耗的光伏、风电、外部电量Q _pv , Q _wp , Q _grid - photovoltaic, wind power, external power consumption in the area on the day

υ_pv、υ_wp——当日区域内光伏、风力发电比例υ _pv , υ _wp —— the ratio of photovoltaic and wind power generation in the region on the day

α_i、β_i——第i号用户认购的光伏、风电比例α _i , β _i - the proportion of photovoltaic and wind power subscribed by the i-th user

——i号光伏、风力发电商申报的t时刻最大发电量

——Maximum power generation at time t declared by PV and wind power generators

函数逼近器：Function approximator:

函数逼近器采用一系列可调参数和从状态作用空间中提取的特征来估计Q值。然后，逼近器建立从参数空间到Q值函数的状态作用对空间的映射。映射可以是线性的，也可以是非线性的。可以利用线性映射分析可解性。线性逼近器的典型形式如下所示：The function approximator uses a series of tunable parameters and features extracted from the state action space to estimate the Q value. The approximator then builds a mapping from the parameter space to the state action-to-space of the Q-value function. The mapping can be linear or non-linear. Solvability can be analyzed using linear mapping. A typical form of a linear approximator looks like this:

其中

是一种可调近似参数向量，

是状态-动作对的特征向量，由下式可得：in

is a tunable approximate parameter vector,

is the eigenvector of the state-action pair, which can be obtained by:

其中

是基函数(BF)，例如高斯径向BF，其中心是状态空间中的选择不动点。一般情况下，对应于固定点的BFs集合均匀分布在状态空间中。在本文中，如果没有指定，所有向量被认为是列向量。(·)^T表示矩阵转置操作。径向基神经网络已被用于随机非线性互联系统，并已证明其具有良好的泛化性能。in

is a basis function (BF), such as a Gaussian radial BF, whose center is a choice fixed point in the state space. In general, the set of BFs corresponding to fixed points is uniformly distributed in the state space. In this article, if not specified, all vectors are considered column vectors. (·) ^T represents the matrix transpose operation. Radial Basis Neural Networks have been used in stochastic nonlinear interconnected systems and have demonstrated good generalization performance.

扩散策略：Diffusion strategy:

强化学习算法在强化学习过程中引入了扩散策略，将自适应组合(ATC)机制引入强化学习算法中。扩散策略可以实现更快的收敛，并且可以达到比一致策略更低的均方偏差。此外，扩散策略对连续实时信号具有更好的响应性能，并且对相邻权重不敏感。扩散策略的基本思想是在每个智能体的自状态更新过程中将基于相邻状态的协作项结合起来。考虑具有状态x_i的智能体i及其动态特性。The reinforcement learning algorithm introduces the diffusion strategy in the reinforcement learning process, and the adaptive composition (ATC) mechanism is introduced into the reinforcement learning algorithm. Diffusion strategies can achieve faster convergence and can achieve lower mean squared deviations than consistent strategies. In addition, the diffusion strategy has better response performance to continuous real-time signals and is insensitive to adjacent weights. The basic idea of the diffusion strategy is to combine cooperative terms based on neighboring states during the self-state update process of each agent. Consider an agent _i with state xi and its dynamics.

x_i(k+1)＝x_i(k)+f(x_i(k))x _i (k+1)=x _i (k)+f(x _i (k))

扩散策略如下所示：The diffusion strategy is as follows:

其中

是一个由扩散策略引入的中间项，x_i(k+1)是通过组合智能体i的所有中间项来更新的状态。N_i是与智能体i相邻点的集合。此外，b_ij是由智能体i分配给相邻智能体j的权重。在这里，我们可以定义一个矩阵B＝[b_ij]∈R^n×n作为微电网通信网络的拓扑矩阵。一般来说，拓扑矩阵B是一个随机矩阵，这意味着B1_n＝1_n，其中1_n∈Rⁿ是单位向量。in

is an intermediate term introduced by the diffusion strategy, and _xi (k+1) is the state updated by combining all intermediate terms of agent i. Ni is the set of points adjacent to agent _i . Furthermore, b _ij is the weight assigned by agent i to neighboring agent j. Here, we can define a matrix B=[b _ij ]∈R ^n×n as the topology matrix of the microgrid communication network. In general, the topological matrix B is a random matrix, which means B1 _n = 1 _n , where 1 _n ∈ R ⁿ is the unit vector.

通过将自适应组合(ATC)扩散策略集成到Greedy-GQ的参数更新过程中，给出了协同强化学习算法。A collaborative reinforcement learning algorithm is presented by integrating the adaptive composition (ATC) diffusion strategy into the parameter update process of Greedy-GQ.

需要注意的是，所提出的协同强化学习算法引入了两个中间向量：

和

实际近似参数向量θ_i(k+1)和校正参数向量ω_i(k+1)是相邻代理的中间矢量的组合。在所提出的算法中，学习速率参数α(k)和β(k)可以用条件P(1)～P(4)来设定。It should be noted that the proposed collaborative reinforcement learning algorithm introduces two intermediate vectors:

and

The actual approximation parameter vector θ _i (k+1) and the correction parameter vector ω _i (k+1) are the combination of the intermediate vectors of adjacent agents. In the proposed algorithm, the learning rate parameters α(k) and β(k) can be set with conditions P(1)~P(4).

α(k)＞0,β(k)＞0 P(1)α(k)＞0, β(k)＞0 P(1)

α(k)/β(k)→0 P(4)α(k)/β(k)→0 P(4)

以上实施例仅用以说明本发明的技术方案而非对其限制，尽管参照上述实施例对本发明进行了详细的说明，所属领域的普通技术人员依然可以对本发明的具体实施方式进行修改或者等同替换，而这些未脱离本发明精神和范围的任何修改或者等同替换，其均在申请待批的本发明的权利要求保护范围之内。The above embodiments are only used to illustrate the technical solutions of the present invention and not to limit them. Although the present invention has been described in detail with reference to the above embodiments, those of ordinary skill in the art can still modify or equivalently replace the specific embodiments of the present invention. , and any modifications or equivalent replacements that do not depart from the spirit and scope of the present invention are all within the protection scope of the claims of the present invention for which the application is pending.

Claims

1. A distributed renewable energy transaction decision-making method based on multi-agent double-layer collaborative reinforcement learning, is characterized in that, comprises the following steps:

Step 1) constructing a two-layer stochastic decision-making optimization model for distributed renewable energy trading, and the two-layer stochastic decision-making optimization model for distributed renewable energy trading in step 1) includes upper-level planning modeling and lower-level planning modeling, respectively. Corresponding to the two parts of the energy trading link;

Step 2) Introduce a multi-agent double-layer collaborative reinforcement learning algorithm, carry out learning and training according to the theoretical framework of the multi-agent double-layer collaborative reinforcement learning algorithm, and establish a function approximator; the function approximator adopts a series of adjustable parameters and from The features extracted from the state action space are used to estimate the Q value. The approximator establishes the mapping of the state action from the parameter space to the Q value function to the space. The mapping is linear or non-linear, and the linear mapping is used to analyze the solvability. The function approximator The typical form of is as follows:

in

is a tunable approximate parameter vector,

is the eigenvector of the state-action pair,

is the basis function (BF); ( ) ^T represents the matrix transpose operation;

Step 3) utilizes iterative calculation method to obtain the estimated value of the optimal Q value function on the basis of the frame described in step 2);

Step 4) use the trained multi-agent double-layer collaborative reinforcement learning algorithm to solve the optimization model, and complete the optimization calculation; the opportunity-constrained planning of maximizing the optimistic value of the objective function constructed in the upper-level planning modeling, the optimization goal of which is economic benefit Maximum, the constraints are composed of objective constraints and chance constraints, and the mathematical expression of the upper-level programming modeling is as follows:

Constraint function:

In the formula, λ—the time-sharing quotation of the generator, where λ ^t is the quotation at time t, ξ—the random variable caused by the unknown quotation of the bidder,

—When quoted as λ, when ξ and

—Meet the expected revenue under the confidence of β, in the scenario of q ^t ^,ξ —ξ, the power generation company’s winning bid power in the t period obtained in the lower-level planning, in the scenario of c _s ^ξ —ξ, the lower-level decision-making obtains the new energy for users per unit of electricity Subscription compensation, c _base — unit power generation cost,

- Power generators are

-At time t, the standard power in the ξ scene exceeds

The unbalanced power with the maximum output in the scene,

-exist

In the scenario, the upper limit of the actual output of the distributed power supply at time t, T—a period, and the default value is one hour;

The lower-level planning modeling is used for the bidding scenario, aiming at the comprehensive benefit of market operation, optimizing scheduling, and allocating the bid-winning power of each generator. The mathematical expression of the lower-level planning is as follows:

Constraint function:

In the formula: N _pv , N _wp - the total number of photovoltaic and wind power generators in the area, L - the total number of power users in the area,

—The unit cost of purchasing electricity from the external grid at time t,

- Purchase electricity from the external power grid at time t,

— Purchase electricity from photovoltaic and wind power generators at time t,

—The load amount of the ith power user at time t, comp _pv , comp _wp — The subscription scope of the user is the subscription compensation for each kilowatt-hour of electricity paid in photovoltaic and wind power renewable energy, Q _load-pv-i , Q _load-wp-i —The amount of photovoltaic and wind power subscription electricity payable under the settlement of the user on the day, Q _pv , Q _wp , Q _grid — the photovoltaic, wind power, and external electricity consumed in the region on the day, υ _pv , υ _wp — the photovoltaic and wind power generation in the region on the day Proportion, α _i , β _i - the proportion of photovoltaic and wind power subscribed by the i-th user,

2. A distributed renewable energy transaction decision-making method based on multi-agent double-layer collaborative reinforcement learning according to claim 1, characterized in that: in the step 2), multiple agents are used to process upper-level planning respectively The randomness problem of modeling and lower-level planning modeling itself and the mutual iteration of upper-level planning modeling and lower-level planning modeling; the two-layer collaborative reinforcement learning algorithm introduces a diffusion strategy in the reinforcement learning process, and the adaptive combination ATC mechanism is introduced into reinforcement learning. In the algorithm, the two-layer collaborative reinforcement learning algorithm can adapt to the randomness and uncertainty caused by distributed renewable energy, and can adapt to the complex calculation problem of the two-layer random decision optimization model; in order to avoid the storage of a large number of Q-value tables. , using a function approximator to record Q-values for complex continuous state and action spaces.

3. A distributed renewable energy transaction decision-making method based on multi-agent double-layer collaborative reinforcement learning according to claim 2, characterized in that: the diffusion strategy can achieve faster convergence, and can achieve more consistent lower mean squared deviation of the strategy, the diffusion strategy is as follows:

in,

is an intermediate term introduced by the diffusion strategy, _xi (k+1) is the state updated by combining all intermediate terms of agent _i ; _Ni is the set of points adjacent to agent i; The weight assigned by the body i to the adjacent agent j; here, a matrix B=[b _ij ]∈R ^n×n is defined as the topology matrix of the microgrid communication network; the topology matrix B is a random matrix, B1 _n =1 _n , where 1 _n ∈ R ⁿ is the unit vector.