CN116436033A

CN116436033A - Temperature control load frequency response control method based on user satisfaction and reinforcement learning

Info

Publication number: CN116436033A
Application number: CN202310367857.6A
Authority: CN
Inventors: 陈汝斯; 刘海光; 蔡德福; 李大虎; 杨旋; 周悦; 周鲲鹏; 孙冠群; 王尔玺; 王文娜; 许典
Original assignee: State Grid Corp of China SGCC; State Grid Hubei Electric Power Co Ltd; Electric Power Research Institute of State Grid Hubei Electric Power Co Ltd
Current assignee: State Grid Corp of China SGCC; State Grid Hubei Electric Power Co Ltd; Electric Power Research Institute of State Grid Hubei Electric Power Co Ltd
Priority date: 2023-04-07
Filing date: 2023-04-07
Publication date: 2023-07-14

Abstract

The invention relates to a temperature control load frequency response control method based on user satisfaction and reinforcement learning, which comprises a temperature control load user satisfaction quantization method and a deep reinforcement learning intelligent body model construction. Considering two control modes of direct switch control and temperature setting control of temperature control load, respectively defining an energy storage index and an uncomfortable degree index as load adjustment indexes, and evaluating user satisfaction by adopting a fuzzy comprehensive evaluation method; and (3) establishing a multi-agent model of the temperature control load based on a flexible actor-arbiter algorithm, weighting user satisfaction and frequency adjustment errors to establish a comprehensive evaluation index, and performing parameter updating by the agent according to local temperature control load information and frequency deviation in an objective function of agent optimization, so that model self-adaptive learning is realized to solve the problem of cooperative control of participation frequency response of the temperature control load. Compared with the prior art, the invention has the advantages of reducing the system frequency deviation and improving the user satisfaction.

Description

Temperature control load frequency response control method based on user satisfaction and reinforcement learning

Technical Field

The invention relates to the technical field of temperature control load frequency response control methods, in particular to a temperature control load frequency response control method based on user satisfaction and deep reinforcement learning.

Background

With the continuous improvement of the renewable energy source duty ratio in the power grid, the characteristics of intermittence and fluctuation of the renewable energy source duty ratio bring great challenges to the active power balance and the frequency stability of the power grid. The traditional power system maintains the balance of the system by adjusting the output force of the generating side unit, the adjusting mode is single, and additional economic cost and environmental cost can be generated; in addition, with an increase in electric load and a wide access of renewable energy sources, the regulation capability of the power generation side gradually decreases. The novel power system mainly using new energy can integrate and schedule the resources on the demand side by utilizing advanced information technology so as to provide various auxiliary services. Therefore, the traditional system frequency adjustment can be supplemented by reasonably controlling the resource at the demand side, so that the stability of the power system is enhanced.

In the demand side resource, the temperature control load (thermostatically controlled load, TCL) is a type of electric equipment which is controlled by a thermostat to switch, can realize electric heat conversion and has adjustable temperature, and comprises a heat pump, a water heater, a refrigerator, a heating ventilation air conditioner and the like. The temperature controlled load can be used to provide frequency adjustment services, based primarily on the following three points: firstly, the energy-saving building material is widely distributed in residential, commercial and industrial buildings, and has large adjustable potential; secondly, the energy storage device has good heat storage capacity and can be regarded as distributed energy storage equipment; and thirdly, the control mode is flexible, and the power requirement of the system can be responded in time. Therefore, in order to fully exploit the frequency modulation potential of the flexible resource at the demand side and maintain the grid frequency within a certain offset range, intensive research on a control strategy of the large-scale temperature control load at the demand side is required.

The prior art mainly adopts methods of centralized control, decentralized control and mixed control. The learner establishes a layered centralized load tracking control framework, coordinates the demand side heterogeneous temperature control load aggregator and adopts a state space model for modeling. The distributed control reduces the judging mechanism of the load control to the local control end, programs or threshold values are set in the local control end in advance, when the load side device detects important parameter changes, the load acts according to a strategy set in advance, the judgment of the distributed control is carried out on the local port, therefore, the demand on communication is low, the response speed is high, and the control effect is greatly influenced by user behaviors and errors of the detection device. There are studies to optimize each load setting using a multi-objective optimization method to reduce the required load response amount and trigger decentralized control of the load based on the frequency response index. Hybrid control combines the characteristics of centralized control and decentralized control, establishes a control framework of 'centralized parameter setting-decentralized decision', coordinates a large-scale user and a power grid control center through a Load Aggregator (LA), and a learner establishes a two-stage control model based on hybrid control to participate in energy market transaction, and relieves the change of micro-grid community photovoltaics and loads by utilizing temperature control load based on hybrid control, so that a communication network needs to be established between the control center and all polymers. In the research of the temperature control load participating in auxiliary service, a dynamic model is built in literature, and the direct load control is adopted to verify that the variable frequency heat pump has good performance in providing frequency modulation service, but the dynamic response performance of a single air conditioner is mainly researched, but the coordination control discussion of the load of a large-scale air conditioner is less. The scholars establish a virtual energy storage model of the variable-frequency air conditioner, shield part of model information through a layered control framework, and simplify downlink control by adopting unified broadcast signals, but sacrifice the adjustable capacity of the air conditioner clusters for simplifying the downlink control.

The control modes of the temperature control load are mainly 2, namely direct switch and temperature setting. The learner realizes the adjustment of the frequency based on the direct load switch, and the method has the advantages that the tracking precision of the system is higher and the influence on the comfort level of the user is lower in the adjustment capacity range of the load; the disadvantage is that when the indoor temperature of the load is concentrated near the temperature boundary, the equipment is frequently turned on and off, so that not only can the adjustment task not be completed, but also the service life of the equipment can be reduced. The temperature setting can avoid the above-described drawbacks, but its limitation is that the tracking effect of the power depends on the designed controller (commonly used controllers have a minimum variance controller, a sliding mode controller, an in-mold controller, and the like). In addition, the limitation is also shown in the aspects of large temperature change range, influence on the comfort of users and the like. Researchers have built residential building energy management systems (energy management system, EMS) based on optimization techniques combined with machine learning models that utilize real residential data for training and testing of demand response controllers, while maintaining thermal comfort and reducing energy consumption. Therefore, the influence of the user satisfaction degree is considered in the load response control process, and the method has important significance for the enthusiasm of the modulating user for participating in frequency modulation. There is a literature that proposes a hybrid control strategy based on a parallel structure, which can improve the tracking accuracy of the system, reduce the switching times of the device, but the range of variation of the temperature is very large, which can reduce the comfort of the user.

The advanced reinforcement learning algorithm in recent years provides a new solution to the problem of frequency control of the power system, and has potential of on-line optimization decision when facing the problem of complex nonlinear frequency control by utilizing the strong searching learning capability. Researchers realize cooperative control of the distributed generation units by utilizing a Q learning algorithm of deep reinforcement learning, thereby eliminating frequency deviation of the system. However, Q learning algorithms can only discretize the selection of control actions from a low-dimensional action domain, and thus cannot deal with problems that contain continuous variables. The learner proposes a deep reinforcement learning algorithm acting on the continuous action domain, thereby realizing the self-adaptive control of the load frequency. But only optimally controls a single generator set or a single residential building, and is not suitable for controlling large-scale temperature control loads.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a temperature control load frequency response control method based on user satisfaction and reinforcement learning, which is based on a deep reinforcement learning control strategy of a flexible mobile unit-judging unit framework, can reduce system frequency fluctuation and can improve user satisfaction.

The aim of the invention can be achieved by the following technical scheme:

a temperature control load frequency response control method based on user satisfaction and reinforcement learning comprises the following steps:

1) Establishing a temperature control load model and a power system frequency response model with the temperature control load participating in frequency modulation by adopting a first-order ordinary differential equation;

2) Aiming at temperature control loads adopting two control modes of direct switch control and temperature setting control, respectively establishing user satisfaction degree adjustment indexes of the temperature control loads under the two control modes;

3) According to the user satisfaction degree adjustment index established in the step 2), comprehensive evaluation of the user satisfaction degree is carried out by utilizing a fuzzy comprehensive evaluation method to obtain the user satisfaction degree;

4) Defining a frequency adjustment error index according to the frequency error signal and the tracking power signal of the power system in the control period, and carrying out weighted combination on the user satisfaction degree obtained in the step 3) and the frequency adjustment error index to obtain a comprehensive evaluation index;

5) Establishing a deep reinforcement learning agent model based on a flexible actor-evaluator algorithm, constructing an agent action space and an agent state space according to the frequency change of the power system and the temperature control load operation state environment information of a demand side, and constructing a reward function of the agent model according to the comprehensive evaluation index obtained in the step 4);

6) Training an intelligent body model by utilizing a flexible actor-discriminant algorithm to solve an optimal strategy of the intelligent body model, wherein the training process comprises intelligent body objective function construction, intelligent body strategy iteration and strategy updating, intelligent body parameter updating, and constructing an objective function of the intelligent body according to an action space, a state space and a reward function in the step 5) and combining strategy entropy, wherein the intelligent body realizes objective function maximization by continuously optimizing the strategy, in the process, the intelligent body utilizes a Bellman operator to carry out strategy iteration, then strategy updating is realized by minimizing the divergence of a new strategy and an old strategy, a neural network of a Q value network and a strategy network in the intelligent body model is constructed, and the Q value network and the strategy network and temperature parameters are subjected to iterative updating of the neural network parameters according to different updating strategies, so that the objective function of the intelligent body model is continuously converged, and the optimal strategy of the intelligent body model is obtained;

7) And (3) online applying the trained intelligent agent model in the step (6) in an actual temperature control load cluster, inputting real-time running state information, user satisfaction information and power grid frequency information of the temperature control load cluster into the intelligent agent control model, and enabling the trained intelligent agent to rapidly calculate a control instruction of the temperature control load cluster at the current moment, wherein the temperature control load cluster carries out load adjustment according to the control instruction.

Further, the step 1) adopts a first-order ordinary differential equation to establish a temperature control load model, and the specific steps include:

11 Establishing a first-order ordinary differential equation model introducing a state variable and a virtual variable to represent the dynamic characteristics of any temperature control load;

12 Calculating the sum of rated powers of the temperature control load clusters according to the dynamic characteristic equation of the single temperature control load.

Further, step 2) establishes user satisfaction adjustment indexes of temperature control loads in two control modes, and the specific steps include:

21 For a direct switch control temperature control load cluster, neglecting the influence of a temperature set value on user comfort, directly acting on a device switch in a control mode, and defining an energy storage index C for the temperature control load cluster adopting the direct switch _s The intelligent agent inputs control command to make C as possible _s Close to 0, reduce the start-stop frequency of the apparatus;

22 For temperature setting control type temperature control load cluster, defining uncomfortable degree index C _u The intelligent agent inputs control command to make C as possible _u Close to 0, reducing user discomfort.

Further, in the step 3), according to the user satisfaction adjustment index established in the step 2), comprehensive evaluation of the user satisfaction is performed by using a fuzzy comprehensive evaluation method to obtain the user satisfaction, and the specific steps include:

31 Constructing a user satisfaction factor set comprising energy storage index C _s And discomfort index C _u I.e. u= { C _s ,C _u }；

32 A user satisfaction comment set is constructed, and five comment grades are set according to the user satisfaction degree, namely V= { satisfaction, more satisfaction, generally, less satisfaction and dissatisfaction };

33 Determining the weight of each influencing factor, wherein the factor set is formed by an energy storage index C _s And discomfort index constitutes C _u The importance of the two users is set to be the same, and the weight is set to be 0.5 and 0.5]；

34 Establishing a fuzzy judgment matrix, judging the degree of each factor belonging to each comment, and selecting a membership function as a Gaussian function;

35 Performing fuzzy comprehensive judgment, evaluating user satisfaction, and defining that the smaller the user satisfaction m is, the higher the user satisfaction is.

Further, in step 4), the user satisfaction degree and the frequency adjustment error index obtained in step 3) are weighted and combined to form a comprehensive evaluation index, and the specific steps include:

41 Evaluation of the tracking performance of the system, definition of the frequency adjustment error index E _RMS ，E _RMS The smaller the system is, the higher the tracking accuracy of the system is;

42 Frequency adjustment error index E) _RMS The weighted combination with the user satisfaction m is defined as a comprehensive evaluation index J.

Further, step 5) establishes a deep reinforcement learning agent model based on a flexible actor-evaluator algorithm, and the specific steps include:

51 Establishing input information of an intelligent body model, namely a state space of the intelligent body, forming the state space of the intelligent body by the switching state, rated power, indoor and outdoor temperatures of a temperature control load cluster controlled by the intelligent body, a temperature set value of the temperature control load, frequency deviation of an electric power system and user satisfaction m calculated in the step 3), and inputting the state space into the intelligent body model to realize environment perception of the intelligent body;

52 The control method comprises the steps of) establishing an output control instruction of an intelligent body model, namely an action space of the intelligent body, setting the control instruction of the temperature control load as a load switch instruction and a temperature set value according to two control modes of temperature control load direct switch control and temperature set control, and setting constraint conditions of the control instruction as frequent switch limit and set temperature range limit of the temperature control load;

53 And (3) establishing an optimization target of the intelligent body model according to the comprehensive evaluation index established in the step (4), namely, a reward function required by the intelligent body model, and setting the reward function as a negative value of the comprehensive evaluation index J formed by weighted combination of the user satisfaction degree and the frequency adjustment error index.

Further, the objective function of the flexible actor-evaluator algorithm in step 6) maximizes the policy entropy while maximizing the cumulative rewards, and the specific steps of constructing the objective function of the intelligent agent are:

61 Constructing an objective function including entropy-containing regularization terms, i.e

Wherein: e (·) is the desired function; pi is the policy; s is(s) _q A state space for the q-th agent; a, a _q The action space of the q-th temperature control load; r(s) _q ,a _q ) A reward function for the q-th agent; (s) _q ,a _q )～p _π State-action tracks formed for policy pi; alpha is a temperature term, and determines the influence degree of entropy on rewards; h (pi (|s) _q ) Is a state s) _q Entropy term of policy at time;

62 The entropy item of the strategy is set, and the calculation method is as follows:

further, in step 6), the agent performs policy iteration by using a bellman operator, and the specific construction method is as follows:

71 A cost function is composed of a bonus function and a state space s _t+1 The expected value composition of the updated policy bellman operator contains the expected value of the reward function and the new value function, and the calculation method is as follows:

wherein:

is a state space s _t+1 Is a function of the desired function of (2); t (T) ^π Is a Belman backup operator under the strategy pi; gamma is the discount factor of the prize, V (s _q+1 ) Is state s _q+1 Is a new value function of:

further, in step 6), the Q-value network outputs a single value through the neural network, and the Q-value network parameters have the following update policies:

wherein: θ is a Q-value network parameter; phi is a policy network parameter; v (V) _θ And Q _θ Respectively substituting the new value function and the cost function after the Q value network parameters are substituted;

the policy network output is a gaussian distribution, and the policy network updates the policies as follows:

wherein: z(s) _q ) Is state s _q A time distribution function;

updating the temperature parameters to realize the iterative test of all feasible actions, wherein the updating strategy is as follows:

wherein: pi _q A control strategy for the q-th agent; h ₀ Is an entropy term;

the deep neural network learns to continuously update the Q value network parameter, the strategy network parameter and the temperature parameter, so that the model is continuously converged, and the optimal strategy of the intelligent agent model is solved.

Compared with the prior art, the temperature control load frequency response control method based on user satisfaction and reinforcement learning has the following advantages:

1. according to the invention, the influence of user satisfaction on temperature control load frequency response is considered, and aiming at switch control type and temperature setting type temperature control loads, load regulation indexes of an energy storage index and an uncomfortable degree index are respectively established to represent the satisfaction condition of temperature control load users, and comprehensive evaluation is carried out on the user satisfaction through a fuzzy comprehensive evaluation method, so that a user satisfaction evaluation index is obtained, and the user satisfaction evaluation index is used as one of optimization targets of temperature control load participation frequency response. Meanwhile, considering the frequency adjustment effect of the power system, the user satisfaction evaluation index and the frequency adjustment error index are weighted and combined to form an objective function, and the objective function is set to be an agent model rewarding function. The method has stronger improvement on the satisfaction degree of the user;

2. the invention provides a method for establishing a deep reinforcement learning agent model based on a flexible actor-critic (SAC) algorithm, wherein an agent and an environment continuously interact according to a Markov decision process (Markov decision process, MDP), an environmental state is obtained, actions are adopted to change the environmental state, corresponding rewards or penalties are obtained as update guidance of model parameters, the maximum accumulated rewards are obtained through continuous learning, and an accurate and effective control decision is made. The method has strong promotion on reducing frequency fluctuation.

Drawings

FIG. 1 is a frequency modulation model of an electric power system with the participation of a temperature control load in an embodiment of the invention;

FIG. 2 shows the temperature-controlled load operating characteristics under two control modes, namely a switch control mode and a temperature setting mode, according to an embodiment of the present invention;

FIG. 3 is a decision making process taken in a flexible actor-evaluator deep reinforcement learning model in accordance with an embodiment of the present invention.

Detailed Description

The invention will now be described in detail with reference to the drawings and specific examples. It will be apparent that the described embodiments are some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments in accordance with the present disclosure. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.

The invention relates to a temperature control load frequency response control method based on user satisfaction and reinforcement learning, and provides a temperature control load frequency cooperative control method considering user satisfaction based on deep reinforcement learning of a flexible actor-discriminator so as to solve the problem of large-scale power system frequency control of a temperature control load on a demand side to participate in frequency modulation, and further optimize system frequency control and improve user satisfaction.

The main principle of the temperature control load frequency response control method based on user satisfaction and deep reinforcement learning established by the invention is as follows:

in terms of user satisfaction evaluation, the influence of two control modes of switch control type and temperature setting type on the satisfaction of temperature control load users is considered. Fig. 2 shows the operation characteristics of the temperature-controlled load in two control modes, wherein the temperature set value of the temperature-controlled load controlled by direct switch is kept unchanged, the adjustment command is to set the switch state of the temperature-controlled load, and the temperature-controlled load scheduling command of temperature-set control is to adjust the temperature set value upwards or downwards. For quantifying the user satisfaction level under different control modes, respectively establishing an energy storage index and an uncomfortable degree index for two temperature control loads, and evaluating the user satisfaction by a fuzzy comprehensive discrimination method; then, in order to realize the frequency cooperative control of the large-scale temperature control load, a multi-agent control model is established based on a flexible actor-discriminant algorithm, user satisfaction and frequency adjustment deviation are used as optimization targets, the on-off state and the temperature set value of the large-scale temperature control load are used as optimization variables, the agents are trained interactively with the environment, and the frequency response control of the on-line cooperative temperature control load cluster can be realized through the trained multi-agent reinforcement learning model considering the user satisfaction.

In the aspect of a control algorithm of large-scale temperature control load frequency response, a deep reinforcement learning method based on the SAC algorithm is to acquire an environmental state through continuous interaction of an intelligent agent and the environment, take actions to change the environmental state, acquire corresponding rewards or penalties as update guidance of model parameters, and acquire the maximum accumulated rewards in continuous learning. In the iterative calculation of each moment, the Actor (Actor) firstly calculates the frequency deviation of the power system and the running state s of the temperature control load cluster according to the observation at the moment _t Generating action a through a policy network _t (i.e., control variables). Then, the temperature control load cluster performs state transition according to the control strategy at the moment to reach the state s at the next moment _t+1 . At the same time, the system environment calculates a reward r(s) _t ,a _t ) (objective function), feedback to agent, agent record (s _t ,a _t ,r(s _t ,a _t ),s _t+1 ) To an experience pool. Then, the action policy sample of the mobile device is input to the evaluator (Critic) together with the system state, and the action-cost function Q (s _t ,a _t ) To evaluate the policy. The process is repeatedly carried out circularly, and the actor and the judge update the neural network parameters of the actor and the judge through a gradient descent method, so that the model self-adaptive learning is realized. During the training process, the accumulated returns of the intelligent agents in the response period gradually increase and finally tend to be stable. The SAC reinforcement learning algorithm improves the robustness of the algorithm and accelerates the training speed by introducing a maximum entropy encouraging strategy, and can make accurate and effective control decisions for large-scale temperature control loads in a complex power supply and demand environment.

Based on the principle, the temperature control load frequency response control method based on user satisfaction and deep reinforcement learning specifically comprises the following steps:

the first step, the first order ordinary differential equation model taking the indoor environment, the outdoor environment and the building characteristics into consideration has high accuracy and simple calculation, is widely applied in practice, and adopts the model to build a temperature control load dynamic model and a temperature control load frequency response model of a power system participating in frequency modulation (shown in figure 1), and specifically comprises the following operations:

11 Introduction of state variables T in a model _i And virtual variable s _i The operating characteristics of the i-th temperature controlled load in the cooling mode can be expressed as:

wherein s is _i (k) The change rule of (2) is as follows:

wherein: t (T) _∞ (k) And T _i (k) The outdoor temperature and the indoor temperature are respectively; c (C) _i 、R _i 、P _i The equivalent heat capacity, the equivalent thermal resistance and the energy transfer rate of the ith temperature control load are respectively; s is(s) _i (k) Indicating the load switch state, the on state s _i (k) =1, off state s _i (k)＝0；T _i ^max And T _i ^min The upper limit and the lower limit of the temperature during load operation are respectively set; t (T) _i ^set Is a temperature set point; delta is a temperature dead zone section and is a constant; k and ak are the run time and control period, respectively. Solving the differential equation can be achieved:

wherein: t (T) _i (0) Indicating the initial indoor temperature.

For a load cluster consisting of N temperature-controlled loads, the aggregate power consumption P thereof _total (k) For the sum of the rated powers of all loads, i.e.

Wherein: p (P) _i ⁿ Rated power for the ith temperature controlled load; η (eta) _i Is the energy conversion efficiency coefficient of the ith temperature control load.

As shown in FIG. 1, the power system frequency response model for the temperature-controlled load to participate in frequency modulation is shown, wherein T is _Ga And T _Gb Respectively the time constants of the speed regulator and the steam turbine, wherein an instantaneous characteristic compensation link is arranged between the speed regulator and the steam turbine, and the time constant T is used for ₁ And T ₂ The lead-lag transfer function between T _R Is a temperature control load response delay time constant, T _c R is a communication delay time constant _eq For the unit difference adjustment rate, deltaP _G 、ΔP _L H, D represent the total output power of the generator, the disturbance power of the system, the system inertia time constant and the load damping coefficient, respectively, Δf being the frequency offset.

Step two, respectively establishing user satisfaction quantitative indexes of two loads aiming at temperature control loads of different control modes, wherein the specific operation is as follows:

21 Defining energy storage index C for temperature control load cluster adopting direct switch in refrigeration mode _s ：

According to C _s As can be seen from the definition of C _s The closer to 0, the closer to the temperature setting the indoor temperatureThe temperature distribution of the temperature control load is uniform, the adjustable potential is large, and the switch is not frequently switched, so that the intelligent body can make C as much as possible when inputting a control instruction _s Close to 0.

22 Defining an discomfort index C for a temperature-controlled load cluster using temperature setting control _u The method comprises the following steps:

C _u ＝|T _i ^set -T _i ^set (0)| (8)

wherein T is _i ^set (0) Indicating the initial temperature set point. From C _u The definition of (a) shows that the more the temperature set point deviates from the initial temperature set point, the higher the user's discomfort level. Therefore, the agent should make C as possible when outputting the control command _u Approaching 0, thereby reducing user discomfort.

Step three, comprehensively evaluating the user satisfaction degree by using a fuzzy comprehensive evaluation method, wherein the method comprises the following specific steps of:

31 Constructing a user satisfaction factor set comprising energy storage index C _s And discomfort index C _u I.e. u= { C _s ,C _u }。

32 A user satisfaction comment set v= { satisfactory, more satisfactory, generally less satisfactory, dissatisfaction }, is constructed.

33 Determining the weight of each factor. Due to the factor set herein being defined by C _s And C _u These 2 factors constitute that the importance to the user is the same, so weight a= [ a ] ₁ ,a ₂ ]＝[0.5,0.5]。

34 A fuzzy judgment matrix is established. First, the degree to which each factor is affiliated with each comment is evaluated. Since most things follow a normal distribution, the membership function is chosen as a Gaussian function, i.e

Wherein: y is _s The inputs of the s-th factor are C _s And C _u ；u _sp Sum sigma _sp Respectively the s-th factor and the p-th commentMean and standard deviation of (a). The fuzzy evaluation matrix R is:

35 A fuzzy comprehensive judgment is carried out. The fuzzy evaluation set is as follows:

wherein:

representing the operation of the fuzzy matrix.

Because the weighted average type fuzzy synthesis operator has obvious weight effect and strong comprehensive degree, the information of R can be fully utilized, so the element b _p The method comprises the following steps:

36 Evaluating user satisfaction. In order to realize continuous and quantitative grading, setting the grading ranks corresponding to the B elements of the matrix to be 1, 2, 3, 4 and 5 respectively, and defining the user satisfaction m as follows:

from the definition of m, the smaller m is, the higher the user satisfaction is.

Step four, the user satisfaction evaluation index and the frequency adjustment error index are weighted and combined to form a comprehensive evaluation index, and the comprehensive evaluation index is set as an agent model rewarding function, and the specific steps are as follows:

41 Defining a root mean square error index E of frequency adjustment for quantifying the power system frequency adjustment level _RMS The method comprises the following steps:

wherein: n (N) _s For controlling the number of periods ak; e (Δk) is an error signal within the control period Δk;

and->

The minimum and maximum values of the tracking power signal, respectively. From E _RMS As can be seen from the definition of E _RMS The smaller the tracking accuracy of the system is, the higher.

42 For comprehensive evaluation of the control effect, providing basis for optimizing the power distribution signal, defining a comprehensive evaluation index J as:

J＝(1-λ)E _RMS +λm (15)

wherein: lambda is the satisfaction specific gravity.

In order to ensure the stability of the power grid frequency preferentially, the user satisfaction degree can be considered when the tracking precision is within a certain range, otherwise, the user satisfaction degree is not considered when the frequency deviation exceeds the specified range, and the temperature control load is scheduled to the greatest extent to participate in the frequency adjustment. Lambda and E _RMS The relationship of (2) is as follows:

wherein: f (F) ₁ 、F ₂ 、F ₃ 、G ₁ 、G ₂ 、G ₃ All are constants, and are set to {2%,3%,5%,0.8,0.5,0.3}, respectively.

Step five, establishing a deep reinforcement learning intelligent body model (shown in fig. 3) based on a flexible actor-evaluator algorithm, constructing an intelligent body action space and a state space according to the frequency change of the power system and the temperature control load running state environment information of the demand side, and constructing a reward function of the intelligent body model according to the comprehensive evaluation index obtained in the step 4). The temperature control load control framework based on deep reinforcement learning is shown in fig. 3, and the specific operation is as follows:

51 The state space of the deep reinforcement learning intelligent agent is established, and the state space can reflect the comprehensive and real physical state of the whole system, wherein the state space comprises the switching state, rated power, indoor and outdoor temperatures, temperature setting values of the temperature control loads, control modes, user satisfaction and frequency deviation of an electric power system of each temperature control load in a temperature control load cluster controlled by the intelligent agent.

52 Establishing a deep reinforcement learning agent action space, wherein the action space variable corresponds to a control variable of the whole system, and comprises a switching instruction and a temperature set value of each temperature control load in a temperature control load cluster controlled by the agent, and constraint conditions of the action space comprise frequent switching limitation and set temperature constraint of the temperature control loads.

53 Establishing a reward mechanism of the deep reinforcement learning intelligent agent, wherein the reward mechanism consists of system frequency deviation and user satisfaction, and the system frequency deviation uses square root error index E _RMS The user satisfaction is represented by a user satisfaction index m, and since the reinforcement learning agent takes the form of maximizing the cumulative return, the reward function is set to a negative value of the weighted combination of the frequency deviation and the user satisfaction.

Training an intelligent body model by utilizing a flexible actor-arbiter algorithm to solve an optimal strategy of the intelligent body model, wherein the training process comprises intelligent body objective function construction, intelligent body strategy iteration and strategy updating, intelligent body parameter updating, constructing an objective function of the intelligent body according to an action space, a state space and a reward function and combining strategy entropy, realizing objective function maximization by continuously optimizing the strategy by the intelligent body, carrying out strategy iteration by the intelligent body in the process, then realizing strategy updating by minimizing the divergence of a new strategy and an old strategy, constructing a neural network of a Q value network and a strategy network in the intelligent body model, carrying out iterative updating of the neural network parameters by the Q value network and the strategy network and the temperature parameters according to different updating strategies, and continuously converging the objective function of the intelligent body model, thereby obtaining the optimal strategy of the intelligent body model.

The method comprises the following specific steps of:

61 Constructing an objective function containing entropy regularization terms, i.e

Wherein: e (·) is the desired function; pi is the policy; s is(s) _q A state space for the q-th agent; a, a _q The action space of the q-th temperature control load; r(s) _q ,a _q ) A reward function for the q-th agent; (s) _q ,a _q )～p _π State-action tracks formed for policy pi; alpha is a temperature term, and determines the influence degree of entropy on rewards; h (pi (|s) _q ) Is a state s) _q Entropy term of policy at that time.

62 To avoid the greedy sampling from falling into local optimum in the process of agent relearning, the calculation method for setting the strategy entropy item is as follows:

the temperature control load intelligent agent performs iterative calculation in the training process, and the specific construction method of the iterative strategy comprises the following steps:

71 A cost function is composed of a bonus function and a state space s _t+1 The expected value composition of the new value function and the expected value of the rewarding function, the cost function is used for strategy value evaluation, the Belman operator is used for strategy updating, and the calculation method is as follows:

wherein:

is a state space s _t+1 Is a function of the desired function of (2); t (T) ^π Is a Belman backup operator under the strategy pi; gamma rayA discount factor for the reward; v(s) _q+1 ) Is state s _q+1 Is a new value function of V (s _q+1 ) The calculation method of (2) is as follows:

the cost function is continuously updated by strategies to realize:

Q ^k+1 ＝T ^π Q ^k (22)

wherein: q (Q) ^k Is the cost function at the kth calculation.

The bellman backup operator and the above steps are iterated continuously, and the method can be realized:

wherein:

is a soft Q value.

72 In the agent policy promotion process, to make the policy trend to the exponential form of the Q value function, the policy update adopts the form of minimizing the KL divergence, namely, the SAC policy update method is as follows:

wherein: d (D) _KL KL divergence (KL divergence); pi is a policy set;

is a cost function under the old policy pi old; />

Is the old policy pi ^old The lower distribution function is used for normalizing the distribution.

The method for establishing the flexible actor-evaluator deep reinforcement learning intelligent agent model needs to establish a SAC algorithm, and specifically comprises the following steps:

to improve self-adaptive learning and generalization capability of an intelligent agent model, constructing a neural network comprising a Q value network and a strategy network;

the Q value network outputs a single value through the neural network, the strategy network outputs a Gaussian distribution, and the parameter learning of the Q value network is realized through the minimum residual error J _Q (θ) implementation, the Q-value network and the policy network will perform policy updates as follows:

/>

wherein: θ is a Q-value network parameter; phi is a policy network parameter;

and Q _θ Respectively substituting the new value function and the cost function after the Q value network parameters are substituted; z(s) _q ) Is state s _q And the component function and alpha are temperature parameters.

The temperature parameters are adaptively updated in the training process, and the updating strategy of the temperature parameters is as follows:

wherein: pi _q A control strategy for the q-th agent; h ₀ Is an entropy term.

The deep neural network learns to continuously update the Q value network parameter, the strategy network parameter and the temperature parameter, so that the model is continuously converged, and the optimal strategy is solved.

And seventhly, carrying out online application on the trained intelligent body model in an actual temperature control load cluster, inputting real-time running state information, user satisfaction information and power grid frequency information of the temperature control load cluster into the intelligent body control model, and calculating a control instruction of the temperature control load cluster at the current moment by the trained intelligent body, wherein the temperature control load cluster carries out load adjustment according to the control instruction.

The invention provides a temperature control load frequency response control method based on user satisfaction and deep reinforcement learning by considering the influence of the user satisfaction on the temperature control load participation frequency adjustment and the advantages of offline training and online execution of a deep reinforcement learning algorithm. Firstly, according to the operation characteristics of two temperature control loads, namely direct switch control and temperature setting control, respectively defining an energy storage index and an uncomfortable degree index to represent user satisfaction influence factors; secondly, establishing a user satisfaction evaluation system by using a fuzzy comprehensive evaluation method according to defined index factors, and quantifying the user satisfaction of the temperature control load; then, a deep reinforcement learning agent model based on a SAC algorithm is established, the agent model has better self-learning capability aiming at random uncertainty of a large-scale temperature control load, and the agent adaptively completes training of the model according to a set strategy and a parameter updating mode by interacting with environment states such as a large-scale temperature control load running state, system deviation and the like; the trained intelligent body model is applied to the temperature control load cluster in actual operation, so that the user satisfaction degree of the temperature control load can be considered while the cooperative control of the large-scale temperature control load frequency response is realized, and the method has good engineering practical value.

While the invention has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions may be made without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims

1. A temperature-controlled load frequency response control method based on user satisfaction and reinforcement learning, comprising the steps of:

7) And (3) online applying the trained intelligent agent model in the step (6) in an actual temperature control load cluster, inputting real-time running state information, user satisfaction information and power grid frequency information of the temperature control load cluster into the intelligent agent control model, and calculating a control instruction of the temperature control load cluster at the current moment by the trained intelligent agent, wherein the temperature control load cluster carries out load adjustment according to the control instruction.

2. The temperature-controlled load frequency response control method based on user satisfaction and reinforcement learning according to claim 1, wherein the step 1) of establishing the temperature-controlled load model by using a first-order ordinary differential equation comprises the following specific steps:

3. The method for controlling frequency stability under participation of temperature-controlled load considering user satisfaction according to claim 1, wherein step 2) establishes user satisfaction adjustment index of temperature-controlled load under two control modes, and the specific steps include:

4. The temperature-controlled load frequency response control method based on user satisfaction and reinforcement learning according to claim 1, wherein the step 3) is based on the user satisfaction adjustment index established in the step 2), and the comprehensive evaluation of the user satisfaction is performed by using a fuzzy comprehensive evaluation method to obtain the user satisfaction, and the specific steps include:

5. The temperature-controlled load frequency response control method based on user satisfaction and reinforcement learning according to claim 1, wherein in step 4), the user satisfaction and the frequency adjustment error index obtained in step 3) are weighted and combined into a comprehensive evaluation index, and the specific steps include:

6. The temperature-controlled load frequency response control method based on user satisfaction and reinforcement learning according to claim 1, wherein the step 5) of establishing a deep reinforcement learning agent model based on a flexible actor-evaluator algorithm comprises the following specific steps:

7. The temperature-controlled load frequency response control method based on user satisfaction and reinforcement learning according to claim 1, wherein the objective function of the flexible actor-evaluator algorithm in step 6) maximizes the policy entropy while maximizing the cumulative rewards, and the specific steps of constructing the objective function of the intelligent agent are:

8. the temperature control load frequency response control method based on user satisfaction and reinforcement learning according to claim 1, wherein the agent in step 6) performs strategy iteration by using a bellman operator, and the specific construction method is as follows:

wherein:

the cost function is continuously realized through strategy updating:

Q ^k+1 ＝T ^π Q ^k

wherein: q (Q) ^k A cost function for the kth calculation;

wherein:

is soft Q;

72 The policy update takes the form of minimizing the KL divergence, namely the SAC policy update method is as follows:

wherein: d (D) _KL KL divergence; pi is a policy set;

is a cost function under the old policy pi old; />

9. The method for controlling the frequency response of a temperature-controlled load based on user satisfaction and reinforcement learning according to claim 1, wherein in the step 6), the Q-value network outputs a single value through a neural network, and the Q-value network parameters have the following update policies:

wherein: θ is a Q-value network parameter; phi is a policy network parameter;

and Q _θ Respectively substituting the new value function and the cost function after the Q value network parameters are substituted;

wherein: z(s) _q ) Is state s _q A time distribution function;

wherein: pi _q A control strategy for the q-th agent; h ₀ Is an entropy term;