CN113809780A

CN113809780A - An optimal scheduling method for microgrid based on improved Q-learning penalty selection

Info

Publication number: CN113809780A
Application number: CN202111115317.6A
Authority: CN
Inventors: 姜河; 周航; 安琦; 叶瀚文; 李兆滢; 赵琰; 林盛; 赵涛; 胡宸嘉; 白金禹; 辛长庆; 何雨桐; 王亚茹; 姜铭坤; 魏莫杋; 孙笑雨
Original assignee: Shenyang Institute of Engineering
Current assignee: Shenyang Institute of Engineering
Priority date: 2021-09-23
Filing date: 2021-09-23
Publication date: 2021-12-17
Anticipated expiration: 2041-09-23
Also published as: CN113809780B

Abstract

The invention relates to a microgrid optimization scheduling method based on improved Q-learning penalty selection, comprising the following steps: Step 1: constructing an objective function based on the operating cost of conventional units in the microgrid, the environmental benefit cost, and the power interaction cost of the large power grid; Step 2: Establish the constraints for the operation of the microgrid; Step 3: Construct the penalty reward function with the highest and lowest thresholds of the highest cost of abandoning wind and solar energy and the cost of complete wind and light absorption as the highest and lowest thresholds; Step 4: Using the multiverse optimization algorithm to improve the traditional Q-learning algorithm; Step 5: The objective function obtained in step 1 is subjected to Markov decision description processing, and the obtained state and space description are solved by the improved Q-learning algorithm. The invention reduces the abandonment rate of renewable energy in the operation and scheduling of the microgrid, reduces the volatility of the energy interaction between the microgrid and the large grid, solves the problems of slow response and non-convergence of the traditional optimization method, and improves the stability of the microgrid operation. sex and economy.

Description

Microgrid optimization scheduling method based on improved Q learning penalty selection

Technical Field

The invention relates to a microgrid economic dispatching method, in particular to a microgrid optimal dispatching method based on improved Q learning penalty selection.

Background

Along with the continuous adjustment of energy structures, a micro-grid system which is composed of various types of energy equipment and widely dispersed is widely applied by virtue of the advantages of independent power transmission, power distribution, rapid scheduling, large renewable energy ratio, island operation and the like. The micro-grid system can improve the power supply quality of remote areas and can effectively prevent the problems of power supply interruption and the like caused by natural disasters.

With the continuous support of national policies on new energy industries, the wind-solar grid-connected scale is continuously increased. However, due to the fluctuation and uncertainty of wind power and photovoltaic output, the large-scale access of the photovoltaic grid to the microgrid causes the problems of unbalanced power inside the system, reduced power quality and the like. How to promote the new energy power generation ratio while ensuring the stable and safe operation in the micro-grid system is a problem which needs to be solved urgently at present.

The inside of the microgrid comprises a traditional unit, a new energy generator set, an energy storage unit and various load requirements, and the problem of the power generation cost of a single unit considered by the traditional scheduling problem cannot meet the requirements of quick, economic, environmental protection and safe scheduling pursued by the microgrid system. Therefore, the method has important significance for multi-target comprehensive scheduling of the micro-grid system, new operating conditions of various units and optimization and coordination of various units and load requirements.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a microgrid optimization scheduling method based on improved Q learning penalty selection, a reward and penalty step type wind and light abandoning penalty return function is introduced into a traditional microgrid scheduling method in which a conventional unit, a wind and light unit and an energy storage unit run in a coordinated mode, and the state and action of the microgrid scheduling problem are described through a Q learning algorithm improved by a multi-universe optimization algorithm, so that the lowest overall scheduling cost is realized on the basis of meeting the penalty return function, the abandonment rate of renewable energy sources is reduced, the volatility of energy interaction between a microgrid and a large power grid is reduced, the problems of slow response and non-convergence of the traditional optimization method are solved, and the stability and the economy of microgrid operation are improved.

In order to solve the problems in the prior art, the technical scheme adopted by the invention is as follows:

a microgrid optimization scheduling method based on improved Q learning penalty selection comprises the following steps:

step 1: constructing a target function according to the running cost, the environmental benefit cost and the large power grid power interaction cost of a conventional unit inside a micro-grid;

step 2: establishing constraint conditions of micro-grid operation;

and step 3: constructing a penalty return function taking the highest wind abandon cost and the wind-light complete absorption cost as the highest and the lowest threshold values;

and 4, step 4: improving the traditional Q learning algorithm by adopting a multi-universe optimization algorithm;

the state-action function of the optimized improved Q learning algorithm is represented as follows:

in the formula: f^sAs a state feature of traditional Q learning;

the motion characteristics are optimized by a multivariate universe optimization algorithm;

respectively the initial values of the state characteristic and the action characteristic; e_mvo-pThe expected value under the MVO-Q strategy is obtained; t is the iteration number;

Y^Trespectively is a reward value and a discount coefficient under iteration;

and 5: and (3) carrying out Markov decision description processing on the target function obtained in the step (1), and carrying out planning solution on the obtained state and action description by using an improved Q learning algorithm.

Wherein, the step 1 comprises the following steps:

step 1.1: under the condition of wind-solar high-proportion grid connection, a conventional unit is divided into a conventional operation state and a low-load operation state, and the conventional power generation cost inside a microgrid is represented as follows:

in the formula: a. b and c are cost factors in the normal running state of the conventional unit; p_iOutputting power for the ith conventional unit; g. h, l and p are cost factors in a low-load operation state; kP_i,maxCritical power of the ith conventional unit in a normal operation state and a low-power operation state;

step 1.2: under the condition of uncertain wind and light output, the start-stop cost of the conventional unit is expressed as follows:

in the formula: f_on-offThe start-stop cost of the conventional unit is reduced; c is the number of start-stop times of the unit; k (t)_i,r) The cost of the ith unit for the starting for the r time; t is t_i,rThe continuous shutdown time of the ith unit before C times of starting; c (t)_i,r) It is the operating cost of the associated auxiliary system for the unit cold start; t is t_cold-hotThe shutdown critical time is the shutdown critical time of the unit in cold-state starting and hot-state starting;

step 1.3: the pollutants discharged by the conventional unit for power generation mainly contain nitrogen oxides, sulfur oxides, carbon dioxide and the like, and the treatment cost is expressed as follows:

E_m(P_i)＝(α_i,m+β_i,mP_i+γ_i,mP_i ²)+ζ_i,mexp(δ_i,mP_i)

in the formula: f_gThe cost is reduced for the pollution treatment of the conventional unit; m is the type of the discharged pollutant; e_m(P_i) The discharge amount of pollutants of the ith unit is calculated; eta_mThe treatment cost coefficient of the m-th pollutants;

α_i,m、β_i,m、γ_i,m、ζ_i,m、δ_i,mthe discharge coefficient of the mth pollutant discharged by the ith unit;

step 1.4: the power exchange cost of the micro grid and the large grid is expressed as follows:

in the formula: lambda [ alpha ]_pThe electricity selling value is 1 and the electricity purchasing value is-1 for the micro-grid electricity selling and purchasing state; p_su/shExcess and shortage of power inside the microgrid;

the price of electricity sold and purchased by a large power grid;

step 1.5: the method is characterized in that an objective function is constructed according to the running cost, the environmental benefit cost and the power exchange cost of a main power grid of a conventional unit in a microgrid, and is expressed as follows:

minF＝F_cf+F_on-off+F_g+F_grid。

in the formula: f is an objective function value of the micro-grid system operation; f_cf、F_on-off、F_g、F_gridThe operation cost, the start-stop cost, the pollution treatment cost and the power interaction cost of the micro-grid and the large grid are respectively the conventional unit operation cost, the start-stop cost, the pollution treatment cost and the micro-grid and large grid power interaction cost.

Wherein, the step 2 comprises the following steps:

step 2.1: the power balance constraint is expressed as follows:

in the formula:

respectively representing a conventional unit, wind power and photovoltaic output power in a time period t;

storing and releasing power of the storage battery for a period t; p_t ^gridThe power is interacted with a large power grid; p_t ^LTotal load power for a period t; t is the total operating time period of the micro-grid, and 24 hours are taken;

step 2.2: the battery storage state constraint is expressed as follows:

SOC_min≤SOC(t)≤SOC_max

in the formula: SOC (t) is the state of charge of the storage battery at the t moment; SOC_minAnd SOC_maxRepresenting the maximum and minimum states of charge of the battery, respectively;

step 2.3: for a conventional unit, the accumulated start-stop time should be greater than the minimum continuous start-stop time, and the constraint is expressed as follows:

in the formula:

the minimum continuous stop time of the unit;

the minimum continuous starting time of the unit.

Wherein, the step 3 comprises the following steps:

step 3.1: the minimum and the maximum limit of the wind abandon light quantity in the micro-grid are specified, and the increase interval chi from the wind and light complete consumption to the maximum limit of the wind abandon light quantity is divided_nThe intervals are as follows:

in the formula:

the highest and lowest limit of the wind and light abandoning amount specified in the system respectively; n is the number of the divided intervals; lambda is the growth step length of the specified amount of growth;

step 3.2: according to a quota interval specified by the system for the abandoned wind light quantity, the abandoned wind light quantity is subjected to linearization processing to obtain a reward and punishment stepped abandoned wind light penalty return function, wherein the function is expressed as follows:

in the formula: d_abWind and light abandoning punishment return function values; p^ab,wpThe light discarding amount of the wind discarding of the system; c is a wind and light abandoning penalty coefficient; k is the interval increase step of the penalty factor.

Wherein, the step 5 comprises the following steps:

step 5.1: the objective function in the step 1 comprises unit operation cost, environmental benefit cost and main power grid power exchange cost, and the state description of each main body in the system in the iterative process T is represented as:

F^s＝[F_cf,F_on-off,E_m(P_i),F_g,F_grid,F]

step 5.2: and 2, the constraint conditions comprise output power of a conventional unit, wind power and photovoltaic output power, storage and release power of a storage battery, large power grid interaction power and total load power, and meanwhile, the wind and light abandoning amount reward and punishment principle is considered, discretization is carried out on the principle to obtain action description of each main body in the system in an iteration process T, and the action description is expressed as follows:

step 5.3: the method for solving the optimal value of the objective function by the Q learning algorithm improved by the multivariate cosmic algorithm comprises the following steps:

5.31) defining minimum and maximum limits of abandoned wind and abandoned light quantity in the microgrid, and dividing abandoned windAbandoning the light punishment interval, initializing each parameter of the multi-element universe algorithm, wherein the universe individual number N, the dimension N, the maximum iteration number MAX and the initial wormhole position X_ij；

5.32) randomly selecting the initial state of the Q learning algorithm

5.33) initial action of the multivariate cosmic algorithm optimized Q learning greedy strategy

5.34) outputting an initial state based on a greedy strategy

Performing initial optimization preparation;

5.35) solving an optimal value minF of the objective function according to the optimized initial action;

5.36) judging whether the error precision is met;

5.37) if the error accuracy is satisfied, selecting the action

And calculating the optimal value updating and wormhole distance of the multi-universe algorithm, and simultaneously carrying out the next iteration, wherein the optimal value updating formula is as follows:

in the formula: x_jThe position of the optimal universe individual is determined; p is a radical of₁/p₂/p₃∈[0,1]Is a random number; epsilon is the rate of cosmic expansion; u. of_j，l_jThe upper and lower limits of x; eta is the proportion of wormholes in all individuals, is specified by the iteration number L and the maximum iteration number L, and is expressed as follows:

the multivariate cosmic algorithm optimizing mechanism is that black holes and swinging are selected according to a roulette mechanism, an individual moves in the current optimal cosmic through expansion and self-turning, and the optimal moving distance in the moving process is related to the iteration precision p and is expressed as follows:

5.38) if the error precision is not met, abandoning the iteration action to select the action again and returning to the step 5.35);

5.39) judging whether the objective function value is a global optimum value, if not, returning to the step 5.38);

5.40) if the value is the global optimum value, outputting the final state and action;

5.41) calculating the final result.

Further, in the step 3.2, the reward punishment step-type wind and light abandonment punishment return function is used as an action value in the improved Q learning method.

Further, in the step 4, a multivariate cosmic optimization algorithm is adopted to improve the optimal value of the state feature corresponding to the objective function in the traditional Q learning algorithm.

Further, the step 4 adopts a multivariate cosmic optimization algorithm to improve the conventional Q learning algorithm, and specifically comprises the following steps:

the multi-universe algorithm is used for optimizing the multi-level greedy action of Q learning, the occurrence of redundant action in optimization is reduced, and the Q iteration result is further reduced_mvo-qError accuracy gamma of_T(ii) a And performing next state-action strategy under the condition that the iteration error precision is not satisfied, and performing next optimization processing by adopting a multi-universe algorithm, wherein an optimization formula is expressed as follows:

the invention has the advantages and beneficial effects that:

the method provided by the invention gives consideration to wind-light consumption, environmental benefits and economic benefits, establishes a mathematical model for a target function by considering conventional units, wind-light units, energy storage units, large power grid interaction processes and pollutant treatment inside a microgrid, and introduces a reward and punishment step type wind and light abandoning punishment return function to further plan wind-light power generation grid connection. Meanwhile, a Q learning algorithm improved by a multi-universe algorithm is provided, the state and the action parameters of the traditional Q learning are corresponding to the target function and the constraint condition of the micro-grid dispatching and the light abandoning and punishment of the abandoned wind, and the maximum environmental benefit and the complete wind and light consumption are realized while the stable power supply of the system is met. The improved Q learning algorithm provided by the invention adopts a planning mechanism for optimization, avoids the problem of optimal value local convergence generated in the optimization process of the traditional algorithm, considers a selection mechanism of wind and light abandoning punishment return, and solves the problem of multi-objective optimization in a microgrid scheduling model.

The method reduces the abandonment rate of renewable energy sources in the operation scheduling of the micro-grid, reduces the fluctuation of energy interaction between the micro-grid and the large grid, solves the problems of slow response and non-convergence of the traditional optimization method, and improves the stability and the economy of the operation of the micro-grid.

Drawings

The invention is described in further detail below with reference to the following figures and examples:

FIG. 1 is a flow chart of a Q learning algorithm optimization of a multivariate universe optimization algorithm improvement;

FIG. 2 is a simulation plot wind-solar energy consumption curve;

FIG. 3 is a simulation graph composite cost curve;

fig. 4 is a flowchart of a microgrid optimization scheduling method based on improved Q learning penalty selection according to the present invention.

Detailed Description

The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.

As shown in fig. 4, the method for optimizing and scheduling a microgrid based on improved Q learning penalty selection of the present invention includes the following steps:

step 1: constructing an objective function according to the running cost, the environmental benefit cost and the power exchange cost of a main power grid of a conventional unit inside a microgrid;

step 1.1: under the condition of wind-solar high-proportion grid connection, the conventional unit is divided into a conventional operation state and a low-load operation state, namely the conventional power generation cost inside the microgrid is expressed as follows:

in the formula: f_cfThe running cost of the conventional unit is reduced; a. b and c are cost factors in the normal running state of the conventional unit; p_iOutputting power for the ith conventional unit; g. h, l and p are cost factors in a low-load operation state; kP_i,maxThe critical power of the normal operation state and the low-power operation state of the ith conventional unit.

Step 1.2: under the condition of uncertain wind and light processing, the start-stop cost of the conventional unit is expressed as follows:

in the formula: f_on-offThe start-stop cost of the conventional unit is reduced; c is the number of start-stop times of the unit; k (t)_i,r) The cost of the ith unit for the starting for the r time; t is t_i,rThe continuous shutdown time of the ith unit before C times of starting; c (t)_i,r) It is the operating cost of the associated auxiliary system for the unit cold start; t is t_cold-hotThe unit is the shutdown critical time of cold-state start and hot-state start.

E_m(P_i)＝(α_i,m+β_i,mP_i+γ_i,mP_i ²)+ζ_i,mexp(δ_i,mP_i)

in the formula: f_gThe cost is reduced for the pollution treatment of the conventional unit; m is the type of the discharged pollutant; e_m(P_i) The discharge amount of pollutants of the ith unit is calculated; eta_mThe treatment cost coefficient of the m-th pollutants; alpha is alpha_i,m、β_i,m、γ_i,m、ζ_i,m、δ_i,mThe discharge coefficient of the mth pollutant discharged by the ith unit;

in the formula: f_gridThe cost is the power interaction cost of the micro-grid and the large grid; lambda [ alpha ]_pThe electricity selling value is 1 and the electricity purchasing value is-1 for the micro-grid electricity selling and purchasing state; p_su/shExcess and shortage of power inside the microgrid;

the price of electricity sold and purchased by a large power grid.

minF＝F_cf+F_on-off+F_g+F_grid

in the formula: f is an objective function value of the micro-grid system operation; f_cf、F_on-off、F_g、F_gridRespectively the running cost, the starting and stopping cost, the pollution treatment cost, the micro-grid and the large-scale power grid of the conventional unitGrid power interaction cost.

Step 2: establishing constraint conditions of micro-grid operation;

step 2.1: the power balance constraint is expressed as follows:

in the formula:

storing and releasing power of the storage battery for a period t; p_t ^gridThe power is interacted with a large power grid; p_t ^LTotal load power for a period t; and T is the total operating time period of the micro-grid, and 24h is taken.

Step 2.2: the battery storage state constraint is expressed as follows:

SOC_min≤SOC(t)≤SOC_max

in the formula: SOC (t) is the state of charge of the storage battery at the t moment; SOC_minAnd SOC_maxRepresenting the maximum and minimum states of charge of the battery, respectively.

in the formula:

the minimum continuous stop time of the unit;

the minimum continuous starting time of the unit.

in the formula:

the highest and lowest limit of the wind and light abandoning amount specified in the system respectively; n is the number of the divided intervals; λ is an increase step length of a prescribed quota increase amount.

And 3.2, taking the reward punishment step type wind and light abandoning punishment return function as an action value in the improved Q learning method.

the multivariate universe optimization algorithm is used as a heuristic search algorithm, the universe is used as a feasible problem solution, and cyclic iteration is performed through the interaction of the black holes, the white holes and the wormholes, namely, the optimal selection of the traditional Q learning algorithm in an unsupervised state is subjected to iterative optimization, so that an enhanced target solution is obtained. The state-action function of the optimized improved Q learning algorithm is represented as follows:

in the formula: f^sAs the state characteristic of the traditional Q learning, the state characteristic corresponds to a target function F operated by the micro-grid system;

corresponding to the reward punishment step type wind and light abandoning punishment return function value d for the action characteristics optimized by the multi-universe optimization algorithm_ab；

Y^Trespectively, the reward value and discount coefficient under iteration.

The multi-universe algorithm is used for optimizing the multi-level greedy action of Q learning, the occurrence of redundant action in optimization is reduced, and the Q iteration result is further reduced_mvo-qError accuracy gamma of_T(initial error precision is γ)_T0). And performing next state-action strategy under the condition that the iteration error precision is not satisfied, and performing next optimization processing by adopting a multi-universe algorithm, wherein an optimization formula is expressed as follows:

in the formula:

the action characteristic and the state characteristic at the T-1 moment are obtained;

state characteristics at time T;

is the reward value at time T-1

And improving the optimal value of the state characteristic corresponding to the objective function in the traditional Q learning algorithm by the multivariate universe optimization algorithm.

Step 5.1: the objective function in the step 1 comprises unit operation cost, environmental benefit cost and main power grid power exchange cost, so that the state description of each main body in the system in the iterative process T is represented as follows:

F^s＝[F_cf,F_on-off,E_m(P_i),F_g,F_grid,F]

step 5.3: as shown in fig. 1, the steps of solving the optimal value of the objective function by the Q learning algorithm improved by the multivariate cosmic algorithm are as follows:

5.31) micro-gridsDividing the minimum and maximum limit of the internal abandoned wind light quantity, dividing the abandoned wind light punishment interval, initializing each parameter of the multivariate universe algorithm, wherein the universe individual number N, the dimension N, the maximum iteration number MAX and the initial wormhole position X_ij；

5.32) randomly selecting the initial state of the Q learning algorithm

5.34) outputting an initial state based on a greedy strategy

Performing initial optimization preparation;

5.36) judging whether the error precision is met;

5.37) if the error accuracy is satisfied, selecting the action

5.39) whether the objective function value is the global optimum value or not, and if not, returning to the step 5.38).

5.41) calculating the final result.

Carrying out experiment simulation by adopting the classic electric load requirement in the conventional micro-grid, wherein the experiment parameters are set as follows:

the method provided by the invention is used for carrying out optimized dispatching on a typical micro-grid comprising a wind power plant, a photovoltaic power plant, a gas turbine unit and an energy storage unit, supposing that power interaction exists between the micro-grid and a large power grid, and carrying out optimized solving on an objective function by adopting a traditional particle swarm algorithm and the improved Q learning algorithm to obtain a system comprehensive dispatching plan meeting the maximum wind and light consumption. As shown in FIGS. 2 and 3, through comparative analysis of simulation experiments, the total wind and light consumption of the micro-grid dispatching by using the method provided by the invention is improved by 33.18%, and the comprehensive cost is reduced by 6.51%. Therefore, the wind-solar energy consumption ratio can be greatly improved in the scheduling planning process of the micro-grid, and the maximization of the economic benefit is achieved while the environmental benefit is met.

Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions and scope of the present invention as defined in the appended claims.

Claims

1. a microgrid optimization scheduling method based on improving Q learning penalty selection, is characterized in that: comprise the steps:

Step 1: Construct an objective function based on the operating cost of conventional units in the microgrid, the cost of environmental benefits, and the cost of interaction of large grid power;

Step 2: Establish constraints for microgrid operation;

Step 3: Construct a penalty reward function with the highest cost of abandoning wind and light and the cost of complete absorption of scenery as the highest and lowest thresholds;

Step 4: Use the multiverse optimization algorithm to improve the traditional Q-learning algorithm;

The state-action function of the optimized improved Q-learning algorithm is expressed as follows:

where: F ^s is the state feature of traditional Q-learning;

is the action feature optimized by the multiverse optimization algorithm;

are the initial values of state features and action features, respectively; E _mvo-p is the expected value under the MVO-Q strategy; T is the number of iterations;

Y ^T are the reward value and discount coefficient under the iteration, respectively;

Step 5: Perform Markov decision description processing on the objective function obtained in Step 1, and use the improved Q-learning algorithm to solve the obtained state and action description.

2. a kind of microgrid optimization scheduling method based on improved Q learning penalty selection according to claim 1, is characterized in that described step 1 comprises the steps:

Step 1.1: In the case of a high proportion of wind and solar power grid-connected, the conventional units are divided into normal operation and low-load operation. The conventional power generation cost in the microgrid is expressed as follows:

In the formula: a, b, and c are the cost factors of the conventional unit under normal operation; P _i is the output of the i-th conventional unit; g, h, l, and p are the cost factors under low-load operation; kP _i,max is the critical power of the i-th conventional unit in the normal operating state and the low-power operating state;

Step 1.2: In the case of uncertain wind and solar output, the start and stop costs of conventional units are expressed as follows:

In the formula: F _on-off is the start and stop cost of the conventional unit; C is the start and stop times of the unit; K(t _i,r ) is the rth start-up cost of the i-th unit; t _i,r is the i-th unit at C C(t _i,r ) is the operating cost of the relevant auxiliary systems for the cold start of the unit; t _cold-hot is the critical time of shutdown between cold start and hot start of the unit;

Step 1.3: The pollutants emitted by conventional power generation units mainly contain nitrogen oxides, sulfur oxides and carbon dioxide, etc. The treatment costs are expressed as follows:

E _m (P _i )=(α _i,m +β _i,m P _i +γ _i,m P _i ² )+ζ _i,m exp(δ _i,m P _i )

In the formula: F _g is the pollution control cost of conventional units; M is the type of pollutants discharged; E _m (P _i ) is the discharge amount of pollutants of the i-th unit; η _m is the treatment cost coefficient of the m-th type of pollutants; α _i,m , β _i,m , γ _i,m , ζ _i,m , δ _i,m are the emission coefficients of the mth pollutant discharged by the ith unit;

Step 1.4: The power exchange cost between the microgrid and the large grid is expressed as follows:

F _grid =λ _p P _su/sh C _t ^grid

In the formula: λ _p is the state of electricity sales and purchase of the microgrid, the value of electricity sales is 1, and the value of electricity purchase is -1; P _su/sh is the power surplus and shortage within the microgrid;

The purchase price of electricity for the large power grid;

Step 1.5: Construct the objective function based on the operating cost of conventional units in the microgrid, the cost of environmental benefits, and the cost of power exchange of the main grid, and the expression is as follows:

min F=F _cf +F _on-off +F _g +F _grid .

In the formula: F is the objective function value of the microgrid system operation; F _cf , F _on-off , F _g , and F _grid are the operating cost of conventional units, the cost of starting and stopping, the cost of pollution control, and the cost of power interaction between the micro grid and the large grid. .

3. a kind of microgrid optimization scheduling method based on improved Q learning penalty selection according to claim 1, is characterized in that described step 2 comprises the steps:

Step 2.1: The power balance constraint is expressed as follows:

where:

Respectively represent the output power of conventional units, wind power and photovoltaics in period t;

is the storage and release power of the battery in the t period; P _t ^grid is the interactive power with the large grid; P _t ^L is the total load power in the t period; T is the total operating period of the micro grid, which is 24h;

Step 2.2: The battery storage and release state constraints are expressed as follows:

SOC _min ≤SOC(t)≤SOC _max

In the formula: SOC(t) is the state of charge of the battery at time t; SOC _min and SOC _max represent the maximum and minimum state of charge of the battery, respectively;

Step 2.3: For conventional units, the accumulated start-stop time should be greater than the minimum continuous start-stop time, and the constraints are expressed as follows:

where:

is the minimum continuous stop time of the unit;

It is the minimum continuous start time of the unit.

4. a kind of microgrid optimization scheduling method based on improving Q learning penalty selection according to claim 1, is characterized in that described step 3 comprises the steps:

Step 3.1: Specify the minimum and maximum amount of curtailment of wind and light in the microgrid, and divide the growth interval χ _n from the complete consumption of wind and solar to the maximum amount of curtailed wind and light, and the interval is expressed as follows:

where:

are the maximum and minimum limits of the amount of abandoned wind and light within the system, respectively; n is the number of divided intervals; λ is the growth step of the specified amount increase;

Step 3.2: According to the quota range specified by the system for the amount of abandoned wind and light, linearize it to obtain a reward and punishment ladder type punishment return function for abandoning wind and light, and the function is expressed as follows:

In the formula: d _ab is the value of the penalty reward function for abandoning wind and light; P ^ab,wp is the amount of abandoning wind and light of the system; c is the penalty coefficient for abandoning wind and light; k is the interval growth step size of the penalty coefficient.

5. a kind of microgrid optimization scheduling method based on improving Q learning penalty selection according to claim 1, is characterized in that described step 5 comprises the steps:

Step 5.1: The objective function described in Step 1 includes the operating cost of the unit, the cost of environmental benefits, and the cost of power exchange of the main grid, and the state description of each subject in the system in the iterative process T is expressed as:

F ^s =[F _cf ,F _on-off ,E _m (P _i ),F _g ,F _grid ,F]

Step 5.2: The constraints described in Step 2 include the output power of conventional units, the output power of wind power and photovoltaics, the storage and release power of the battery, the interactive power of the large grid, and the total load power. At the same time, the principle of reward and punishment for abandoning wind and light is discretized. The action description of each subject in the system in the iterative process T obtained by processing N actions is expressed as:

Step 5.3: The improved Q-learning algorithm of the multiverse algorithm solves the optimal value of the objective function. The steps are as follows:

5.31) Specify the minimum and maximum amount of abandoning wind and light in the microgrid, divide the penalty interval for abandoning wind and light, and initialize the parameters of the multiverse algorithm, where the number of universes is N, the dimension is n, the maximum number of iterations MAX, the initial wormhole position X _ij ;

5.32) Randomly select the initial state of the Q-learning algorithm

5.33) The multiverse algorithm optimizes the initial actions of the Q-learning greedy policy

5.34) Based on the greedy strategy, the output initial state is

The initial action of , to prepare for the initial optimization;

5.35) According to the optimized initial action, the optimal value minF of the objective function is solved;

5.36) Judging whether the error accuracy is met;

5.37) If the error accuracy is satisfied, select the action

And calculate the optimal value update of the multiverse algorithm and the wormhole distance, and perform the next iteration at the same time. The optimal value update formula is as follows:

In the formula: X _j is the location of the optimal universe individual; p ₁ /p ₂ /p ₃ ∈[0,1] is a random number; ε is the expansion rate of the universe; u _j , l _j are the upper and lower limits of x; η is the proportion of wormholes in all individuals, specified by the number of iterations l and the maximum number of iterations L, expressed as follows:

The optimization mechanism of the multiverse algorithm is that the black hole and the swing follow the roulette mechanism for selection, and the individual moves to the current optimal universe through expansion and self-transformation. The optimal moving distance in the moving process is related to the iteration accuracy p, which is expressed as follows:

5.38) If the error accuracy is not met, discard this iterative action and re-select the action and return to step 5.35);

5.39) Determine whether the objective function value is the global optimal value, if not, return to step 5.38);

5.40) If it is the global optimal value, output the final state and action;

5.41) Calculate the final result.

6. a kind of microgrid optimization scheduling method based on improved Q learning penalty selection according to claim 4, it is characterized in that described step 3.2 takes reward and punishment ladder type abandoning wind and abandoning light penalty reward function as the action in improving Q learning method value.

7. a kind of microgrid optimization scheduling method based on improving Q learning penalty selection according to claim 1, is characterized in that: described step 4 adopts multiverse optimization algorithm to improve the state characteristic in traditional Q learning algorithm corresponding to the target function. The optimal value.

8. a kind of microgrid optimization scheduling method based on improving Q learning penalty selection according to claim 1, it is characterized in that described step 4 adopts multiverse optimization algorithm to improve the improvement method of traditional Q learning algorithm comprises the following steps:

Use the multiverse algorithm to optimize the multi-level greedy actions of Q learning, reduce the occurrence of redundant actions in the optimization, and then reduce the error accuracy γ _T of the iterative result Q _mvo-q ; In this case, the next state-action strategy is carried out, and the multiverse algorithm is used to carry out the next optimization process. The optimization formula is expressed as follows: