Disclosure of Invention
The technical problem to be solved by the invention is to provide a microgrid optimization scheduling method based on improved Q learning penalty selection, a reward and penalty step type wind and light abandoning penalty return function is introduced into a traditional microgrid scheduling method in which a conventional unit, a wind and light unit and an energy storage unit run in a coordinated mode, and the state and action of the microgrid scheduling problem are described through a Q learning algorithm improved by a multi-universe optimization algorithm, so that the lowest overall scheduling cost is realized on the basis of meeting the penalty return function, the abandonment rate of renewable energy sources is reduced, the volatility of energy interaction between a microgrid and a large power grid is reduced, the problems of slow response and non-convergence of the traditional optimization method are solved, and the stability and the economy of microgrid operation are improved.
In order to solve the problems in the prior art, the technical scheme adopted by the invention is as follows:
a microgrid optimization scheduling method based on improved Q learning penalty selection comprises the following steps:
step 1: constructing a target function according to the running cost, the environmental benefit cost and the large power grid power interaction cost of a conventional unit inside a micro-grid;
step 2: establishing constraint conditions of micro-grid operation;
and step 3: constructing a penalty return function taking the highest wind abandon cost and the wind-light complete absorption cost as the highest and the lowest threshold values;
and 4, step 4: improving the traditional Q learning algorithm by adopting a multi-universe optimization algorithm;
the state-action function of the optimized improved Q learning algorithm is represented as follows:
in the formula: f
sAs a state feature of traditional Q learning;
the motion characteristics are optimized by a multivariate universe optimization algorithm;
respectively the initial values of the state characteristic and the action characteristic; e
mvo-pThe expected value under the MVO-Q strategy is obtained; t is the iteration number;
Y
Trespectively is a reward value and a discount coefficient under iteration;
and 5: and (3) carrying out Markov decision description processing on the target function obtained in the step (1), and carrying out planning solution on the obtained state and action description by using an improved Q learning algorithm.
Wherein, the step 1 comprises the following steps:
step 1.1: under the condition of wind-solar high-proportion grid connection, a conventional unit is divided into a conventional operation state and a low-load operation state, and the conventional power generation cost inside a microgrid is represented as follows:
in the formula: a. b and c are cost factors in the normal running state of the conventional unit; piOutputting power for the ith conventional unit; g. h, l and p are cost factors in a low-load operation state; kPi,maxCritical power of the ith conventional unit in a normal operation state and a low-power operation state;
step 1.2: under the condition of uncertain wind and light output, the start-stop cost of the conventional unit is expressed as follows:
in the formula: fon-offThe start-stop cost of the conventional unit is reduced; c is the number of start-stop times of the unit; k (t)i,r) The cost of the ith unit for the starting for the r time; t is ti,rThe continuous shutdown time of the ith unit before C times of starting; c (t)i,r) It is the operating cost of the associated auxiliary system for the unit cold start; t is tcold-hotThe shutdown critical time is the shutdown critical time of the unit in cold-state starting and hot-state starting;
step 1.3: the pollutants discharged by the conventional unit for power generation mainly contain nitrogen oxides, sulfur oxides, carbon dioxide and the like, and the treatment cost is expressed as follows:
Em(Pi)=(αi,m+βi,mPi+γi,mPi 2)+ζi,mexp(δi,mPi)
in the formula: fgThe cost is reduced for the pollution treatment of the conventional unit; m is the type of the discharged pollutant; em(Pi) The discharge amount of pollutants of the ith unit is calculated; etamThe treatment cost coefficient of the m-th pollutants;
αi,m、βi,m、γi,m、ζi,m、δi,mthe discharge coefficient of the mth pollutant discharged by the ith unit;
step 1.4: the power exchange cost of the micro grid and the large grid is expressed as follows:
in the formula: lambda [ alpha ]
pThe electricity selling value is 1 and the electricity purchasing value is-1 for the micro-grid electricity selling and purchasing state; p
su/shExcess and shortage of power inside the microgrid;
the price of electricity sold and purchased by a large power grid;
step 1.5: the method is characterized in that an objective function is constructed according to the running cost, the environmental benefit cost and the power exchange cost of a main power grid of a conventional unit in a microgrid, and is expressed as follows:
minF=Fcf+Fon-off+Fg+Fgrid。
in the formula: f is an objective function value of the micro-grid system operation; fcf、Fon-off、Fg、FgridThe operation cost, the start-stop cost, the pollution treatment cost and the power interaction cost of the micro-grid and the large grid are respectively the conventional unit operation cost, the start-stop cost, the pollution treatment cost and the micro-grid and large grid power interaction cost.
Wherein, the step 2 comprises the following steps:
step 2.1: the power balance constraint is expressed as follows:
in the formula:
respectively representing a conventional unit, wind power and photovoltaic output power in a time period t;
storing and releasing power of the storage battery for a period t; p
t gridThe power is interacted with a large power grid; p
t LTotal load power for a period t; t is the total operating time period of the micro-grid, and 24 hours are taken;
step 2.2: the battery storage state constraint is expressed as follows:
SOCmin≤SOC(t)≤SOCmax
in the formula: SOC (t) is the state of charge of the storage battery at the t moment; SOCminAnd SOCmaxRepresenting the maximum and minimum states of charge of the battery, respectively;
step 2.3: for a conventional unit, the accumulated start-stop time should be greater than the minimum continuous start-stop time, and the constraint is expressed as follows:
in the formula:
the minimum continuous stop time of the unit;
the minimum continuous starting time of the unit.
Wherein, the step 3 comprises the following steps:
step 3.1: the minimum and the maximum limit of the wind abandon light quantity in the micro-grid are specified, and the increase interval chi from the wind and light complete consumption to the maximum limit of the wind abandon light quantity is dividednThe intervals are as follows:
in the formula:
the highest and lowest limit of the wind and light abandoning amount specified in the system respectively; n is the number of the divided intervals; lambda is the growth step length of the specified amount of growth;
step 3.2: according to a quota interval specified by the system for the abandoned wind light quantity, the abandoned wind light quantity is subjected to linearization processing to obtain a reward and punishment stepped abandoned wind light penalty return function, wherein the function is expressed as follows:
in the formula: dabWind and light abandoning punishment return function values; pab,wpThe light discarding amount of the wind discarding of the system; c is a wind and light abandoning penalty coefficient; k is the interval increase step of the penalty factor.
Wherein, the step 5 comprises the following steps:
step 5.1: the objective function in the step 1 comprises unit operation cost, environmental benefit cost and main power grid power exchange cost, and the state description of each main body in the system in the iterative process T is represented as:
Fs=[Fcf,Fon-off,Em(Pi),Fg,Fgrid,F]
step 5.2: and 2, the constraint conditions comprise output power of a conventional unit, wind power and photovoltaic output power, storage and release power of a storage battery, large power grid interaction power and total load power, and meanwhile, the wind and light abandoning amount reward and punishment principle is considered, discretization is carried out on the principle to obtain action description of each main body in the system in an iteration process T, and the action description is expressed as follows:
step 5.3: the method for solving the optimal value of the objective function by the Q learning algorithm improved by the multivariate cosmic algorithm comprises the following steps:
5.31) defining minimum and maximum limits of abandoned wind and abandoned light quantity in the microgrid, and dividing abandoned windAbandoning the light punishment interval, initializing each parameter of the multi-element universe algorithm, wherein the universe individual number N, the dimension N, the maximum iteration number MAX and the initial wormhole position Xij;
5.32) randomly selecting the initial state of the Q learning algorithm
5.33) initial action of the multivariate cosmic algorithm optimized Q learning greedy strategy
5.34) outputting an initial state based on a greedy strategy
Performing initial optimization preparation;
5.35) solving an optimal value minF of the objective function according to the optimized initial action;
5.36) judging whether the error precision is met;
5.37) if the error accuracy is satisfied, selecting the action
And calculating the optimal value updating and wormhole distance of the multi-universe algorithm, and simultaneously carrying out the next iteration, wherein the optimal value updating formula is as follows:
in the formula: xjThe position of the optimal universe individual is determined; p is a radical of1/p2/p3∈[0,1]Is a random number; epsilon is the rate of cosmic expansion; u. ofj,ljThe upper and lower limits of x; eta is the proportion of wormholes in all individuals, is specified by the iteration number L and the maximum iteration number L, and is expressed as follows:
the multivariate cosmic algorithm optimizing mechanism is that black holes and swinging are selected according to a roulette mechanism, an individual moves in the current optimal cosmic through expansion and self-turning, and the optimal moving distance in the moving process is related to the iteration precision p and is expressed as follows:
5.38) if the error precision is not met, abandoning the iteration action to select the action again and returning to the step 5.35);
5.39) judging whether the objective function value is a global optimum value, if not, returning to the step 5.38);
5.40) if the value is the global optimum value, outputting the final state and action;
5.41) calculating the final result.
Further, in the step 3.2, the reward punishment step-type wind and light abandonment punishment return function is used as an action value in the improved Q learning method.
Further, in the step 4, a multivariate cosmic optimization algorithm is adopted to improve the optimal value of the state feature corresponding to the objective function in the traditional Q learning algorithm.
Further, the step 4 adopts a multivariate cosmic optimization algorithm to improve the conventional Q learning algorithm, and specifically comprises the following steps:
the multi-universe algorithm is used for optimizing the multi-level greedy action of Q learning, the occurrence of redundant action in optimization is reduced, and the Q iteration result is further reducedmvo-qError accuracy gamma ofT(ii) a And performing next state-action strategy under the condition that the iteration error precision is not satisfied, and performing next optimization processing by adopting a multi-universe algorithm, wherein an optimization formula is expressed as follows:
the invention has the advantages and beneficial effects that:
the method provided by the invention gives consideration to wind-light consumption, environmental benefits and economic benefits, establishes a mathematical model for a target function by considering conventional units, wind-light units, energy storage units, large power grid interaction processes and pollutant treatment inside a microgrid, and introduces a reward and punishment step type wind and light abandoning punishment return function to further plan wind-light power generation grid connection. Meanwhile, a Q learning algorithm improved by a multi-universe algorithm is provided, the state and the action parameters of the traditional Q learning are corresponding to the target function and the constraint condition of the micro-grid dispatching and the light abandoning and punishment of the abandoned wind, and the maximum environmental benefit and the complete wind and light consumption are realized while the stable power supply of the system is met. The improved Q learning algorithm provided by the invention adopts a planning mechanism for optimization, avoids the problem of optimal value local convergence generated in the optimization process of the traditional algorithm, considers a selection mechanism of wind and light abandoning punishment return, and solves the problem of multi-objective optimization in a microgrid scheduling model.
The method reduces the abandonment rate of renewable energy sources in the operation scheduling of the micro-grid, reduces the fluctuation of energy interaction between the micro-grid and the large grid, solves the problems of slow response and non-convergence of the traditional optimization method, and improves the stability and the economy of the operation of the micro-grid.
Detailed Description
The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.
As shown in fig. 4, the method for optimizing and scheduling a microgrid based on improved Q learning penalty selection of the present invention includes the following steps:
step 1: constructing an objective function according to the running cost, the environmental benefit cost and the power exchange cost of a main power grid of a conventional unit inside a microgrid;
step 1.1: under the condition of wind-solar high-proportion grid connection, the conventional unit is divided into a conventional operation state and a low-load operation state, namely the conventional power generation cost inside the microgrid is expressed as follows:
in the formula: fcfThe running cost of the conventional unit is reduced; a. b and c are cost factors in the normal running state of the conventional unit; piOutputting power for the ith conventional unit; g. h, l and p are cost factors in a low-load operation state; kPi,maxThe critical power of the normal operation state and the low-power operation state of the ith conventional unit.
Step 1.2: under the condition of uncertain wind and light processing, the start-stop cost of the conventional unit is expressed as follows:
in the formula: fon-offThe start-stop cost of the conventional unit is reduced; c is the number of start-stop times of the unit; k (t)i,r) The cost of the ith unit for the starting for the r time; t is ti,rThe continuous shutdown time of the ith unit before C times of starting; c (t)i,r) It is the operating cost of the associated auxiliary system for the unit cold start; t is tcold-hotThe unit is the shutdown critical time of cold-state start and hot-state start.
Step 1.3: the pollutants discharged by the conventional unit for power generation mainly contain nitrogen oxides, sulfur oxides, carbon dioxide and the like, and the treatment cost is expressed as follows:
Em(Pi)=(αi,m+βi,mPi+γi,mPi 2)+ζi,mexp(δi,mPi)
in the formula: fgThe cost is reduced for the pollution treatment of the conventional unit; m is the type of the discharged pollutant; em(Pi) The discharge amount of pollutants of the ith unit is calculated; etamThe treatment cost coefficient of the m-th pollutants; alpha is alphai,m、βi,m、γi,m、ζi,m、δi,mThe discharge coefficient of the mth pollutant discharged by the ith unit;
step 1.4: the power exchange cost of the micro grid and the large grid is expressed as follows:
in the formula: f
gridThe cost is the power interaction cost of the micro-grid and the large grid; lambda [ alpha ]
pThe electricity selling value is 1 and the electricity purchasing value is-1 for the micro-grid electricity selling and purchasing state; p
su/shExcess and shortage of power inside the microgrid;
the price of electricity sold and purchased by a large power grid.
Step 1.5: the method is characterized in that an objective function is constructed according to the running cost, the environmental benefit cost and the power exchange cost of a main power grid of a conventional unit in a microgrid, and is expressed as follows:
minF=Fcf+Fon-off+Fg+Fgrid
in the formula: f is an objective function value of the micro-grid system operation; fcf、Fon-off、Fg、FgridRespectively the running cost, the starting and stopping cost, the pollution treatment cost, the micro-grid and the large-scale power grid of the conventional unitGrid power interaction cost.
Step 2: establishing constraint conditions of micro-grid operation;
step 2.1: the power balance constraint is expressed as follows:
in the formula:
respectively representing a conventional unit, wind power and photovoltaic output power in a time period t;
storing and releasing power of the storage battery for a period t; p
t gridThe power is interacted with a large power grid; p
t LTotal load power for a period t; and T is the total operating time period of the micro-grid, and 24h is taken.
Step 2.2: the battery storage state constraint is expressed as follows:
SOCmin≤SOC(t)≤SOCmax
in the formula: SOC (t) is the state of charge of the storage battery at the t moment; SOCminAnd SOCmaxRepresenting the maximum and minimum states of charge of the battery, respectively.
Step 2.3: for a conventional unit, the accumulated start-stop time should be greater than the minimum continuous start-stop time, and the constraint is expressed as follows:
in the formula:
the minimum continuous stop time of the unit;
the minimum continuous starting time of the unit.
And step 3: constructing a penalty return function taking the highest wind abandon cost and the wind-light complete absorption cost as the highest and the lowest threshold values;
step 3.1: the minimum and the maximum limit of the wind abandon light quantity in the micro-grid are specified, and the increase interval chi from the wind and light complete consumption to the maximum limit of the wind abandon light quantity is dividednThe intervals are as follows:
in the formula:
the highest and lowest limit of the wind and light abandoning amount specified in the system respectively; n is the number of the divided intervals; λ is an increase step length of a prescribed quota increase amount.
Step 3.2: according to a quota interval specified by the system for the abandoned wind light quantity, the abandoned wind light quantity is subjected to linearization processing to obtain a reward and punishment stepped abandoned wind light penalty return function, wherein the function is expressed as follows:
in the formula: dabWind and light abandoning punishment return function values; pab,wpThe light discarding amount of the wind discarding of the system; c is a wind and light abandoning penalty coefficient; k is the interval increase step of the penalty factor.
And 3.2, taking the reward punishment step type wind and light abandoning punishment return function as an action value in the improved Q learning method.
And 4, step 4: improving the traditional Q learning algorithm by adopting a multi-universe optimization algorithm;
the multivariate universe optimization algorithm is used as a heuristic search algorithm, the universe is used as a feasible problem solution, and cyclic iteration is performed through the interaction of the black holes, the white holes and the wormholes, namely, the optimal selection of the traditional Q learning algorithm in an unsupervised state is subjected to iterative optimization, so that an enhanced target solution is obtained. The state-action function of the optimized improved Q learning algorithm is represented as follows:
in the formula: f
sAs the state characteristic of the traditional Q learning, the state characteristic corresponds to a target function F operated by the micro-grid system;
corresponding to the reward punishment step type wind and light abandoning punishment return function value d for the action characteristics optimized by the multi-universe optimization algorithm
ab;
Respectively the initial values of the state characteristic and the action characteristic; e
mvo-pThe expected value under the MVO-Q strategy is obtained; t is the iteration number;
Y
Trespectively, the reward value and discount coefficient under iteration.
The multi-universe algorithm is used for optimizing the multi-level greedy action of Q learning, the occurrence of redundant action in optimization is reduced, and the Q iteration result is further reducedmvo-qError accuracy gamma ofT(initial error precision is γ)T0). And performing next state-action strategy under the condition that the iteration error precision is not satisfied, and performing next optimization processing by adopting a multi-universe algorithm, wherein an optimization formula is expressed as follows:
in the formula:
the action characteristic and the state characteristic at the T-1 moment are obtained;
state characteristics at time T;
is the reward value at time T-1
And improving the optimal value of the state characteristic corresponding to the objective function in the traditional Q learning algorithm by the multivariate universe optimization algorithm.
And 5: and (3) carrying out Markov decision description processing on the target function obtained in the step (1), and carrying out planning solution on the obtained state and action description by using an improved Q learning algorithm.
Step 5.1: the objective function in the step 1 comprises unit operation cost, environmental benefit cost and main power grid power exchange cost, so that the state description of each main body in the system in the iterative process T is represented as follows:
Fs=[Fcf,Fon-off,Em(Pi),Fg,Fgrid,F]
step 5.2: and 2, the constraint conditions comprise output power of a conventional unit, wind power and photovoltaic output power, storage and release power of a storage battery, large power grid interaction power and total load power, and meanwhile, the wind and light abandoning amount reward and punishment principle is considered, discretization is carried out on the principle to obtain action description of each main body in the system in an iteration process T, and the action description is expressed as follows:
step 5.3: as shown in fig. 1, the steps of solving the optimal value of the objective function by the Q learning algorithm improved by the multivariate cosmic algorithm are as follows:
5.31) micro-gridsDividing the minimum and maximum limit of the internal abandoned wind light quantity, dividing the abandoned wind light punishment interval, initializing each parameter of the multivariate universe algorithm, wherein the universe individual number N, the dimension N, the maximum iteration number MAX and the initial wormhole position Xij;
5.32) randomly selecting the initial state of the Q learning algorithm
5.33) initial action of the multivariate cosmic algorithm optimized Q learning greedy strategy
5.34) outputting an initial state based on a greedy strategy
Performing initial optimization preparation;
5.35) solving an optimal value minF of the objective function according to the optimized initial action;
5.36) judging whether the error precision is met;
5.37) if the error accuracy is satisfied, selecting the action
And calculating the optimal value updating and wormhole distance of the multi-universe algorithm, and simultaneously carrying out the next iteration, wherein the optimal value updating formula is as follows:
in the formula: xjThe position of the optimal universe individual is determined; p is a radical of1/p2/p3∈[0,1]Is a random number; epsilon is the rate of cosmic expansion; u. ofj,ljThe upper and lower limits of x; eta is the proportion of wormholes in all individuals, is specified by the iteration number L and the maximum iteration number L, and is expressed as follows:
the multivariate cosmic algorithm optimizing mechanism is that black holes and swinging are selected according to a roulette mechanism, an individual moves in the current optimal cosmic through expansion and self-turning, and the optimal moving distance in the moving process is related to the iteration precision p and is expressed as follows:
5.38) if the error precision is not met, abandoning the iteration action to select the action again and returning to the step 5.35);
5.39) whether the objective function value is the global optimum value or not, and if not, returning to the step 5.38).
5.40) if the value is the global optimum value, outputting the final state and action;
5.41) calculating the final result.
Carrying out experiment simulation by adopting the classic electric load requirement in the conventional micro-grid, wherein the experiment parameters are set as follows:
the method provided by the invention is used for carrying out optimized dispatching on a typical micro-grid comprising a wind power plant, a photovoltaic power plant, a gas turbine unit and an energy storage unit, supposing that power interaction exists between the micro-grid and a large power grid, and carrying out optimized solving on an objective function by adopting a traditional particle swarm algorithm and the improved Q learning algorithm to obtain a system comprehensive dispatching plan meeting the maximum wind and light consumption. As shown in FIGS. 2 and 3, through comparative analysis of simulation experiments, the total wind and light consumption of the micro-grid dispatching by using the method provided by the invention is improved by 33.18%, and the comprehensive cost is reduced by 6.51%. Therefore, the wind-solar energy consumption ratio can be greatly improved in the scheduling planning process of the micro-grid, and the maximization of the economic benefit is achieved while the environmental benefit is met.
Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions and scope of the present invention as defined in the appended claims.