CN111275572B

CN111275572B - Unit scheduling system and method based on particle swarm and deep reinforcement learning

Info

Publication number: CN111275572B
Application number: CN202010043546.0A
Authority: CN
Inventors: 于长军; 林志赟; 韩志敏
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2020-01-15
Filing date: 2020-01-15
Publication date: 2023-07-11
Anticipated expiration: 2040-01-15
Also published as: CN111275572A

Abstract

The invention discloses a unit scheduling system and a unit scheduling method based on particle swarm and deep reinforcement learning, wherein the system comprises a particle swarm module and a deep reinforcement learning model, wherein the deep reinforcement learning model comprises an evaluation network, an experience playback pool, a target network and a loss function, wherein the input of the particle swarm module is a load demand, the output of the particle swarm module is connected with the evaluation network, and the evaluation network outputs a Q estimated value and the experience playback pool; and the experience playback pool output is connected with the target network, the target network outputs a Q target value, both the Q target value and the Q estimated value are input into the loss function, and the output of the loss function is fed back to the evaluation network. The invention can optimize the dispatching of the unit, and can meet the load requirement, save at least 0.1g of coal per degree of electricity, and realize the control optimization of the dispatching control integration of the bottom equipment and the unit.

Description

Unit scheduling system and method based on particle swarm and deep reinforcement learning

Technical Field

The invention belongs to the field of information control, and relates to a unit scheduling system and method based on particle swarm and deep reinforcement learning.

Background

The economic dispatch of the electric power unit is an important link in the operation of an electric power system, and is always a research object of academic scholars due to the characteristics of multiple constraints, nonlinearity and high dimensionality. The significance of unit economic dispatch optimization is that the power system not only improves the working and running efficiency, greatly improves the comprehensive benefit of power enterprises, reduces the environmental impact, but also realizes the automation and the intellectualization of the system by the application of artificial intelligence.

The power economy schedule can be understood as: on the premise of ensuring that the power production is met, the power generation production of each unit is safely and fully scheduled, so that the power generation cost is minimum. There are many studies on economic optimization problems of the unit, such as genetic algorithm, ant colony algorithm, particle swarm algorithm, neural network, reinforcement learning and various algorithms generated by fusion. With the development of the power system, the economic dispatch complexity of the unit is increased, and different constraints such as the starting and stopping time cost of the unit, the climbing consumption cost of the unit and the like are added in the original economic optimization problem of the unit. However, in all researches, unit scheduling is only optimized in the original system, but control optimization of the parameter integration of the bottom equipment cannot be realized, so that the required coal burning amount cannot be further optimized.

Disclosure of Invention

In order to solve the problems, the invention aims to provide a set scheduling system based on particle swarm and deep reinforcement learning, which comprises a particle swarm module and a deep reinforcement learning model, wherein the deep reinforcement learning model comprises an evaluation network, an experience playback pool, a target network and a loss function,

the input of the particle swarm module is load demand, the output of the particle swarm module is connected with the evaluation network, and the evaluation network outputs a Q estimated value and the experience playback pool; and the experience playback pool output is connected with the target network, the target network outputs a Q target value, both the Q target value and the Q estimated value are input into the loss function, and the output of the loss function is fed back to the evaluation network.

Preferably, the particle swarm module outputs a target coal amount and bottom layer controllable equipment parameters, the target coal amount is used as an input state, and the bottom layer controllable equipment parameters are used as input actions.

Preferably, the evaluation network outputs to the experience playback pool a target coal amount, an underlying controllable device parameter, a predicted reward, and a next state target coal amount.

Preferably, the experience playback pool outputs a target coal amount for a next state to the target network.

Preferably, the particle number in the particle swarm module is 80, the inertia weight w=1, and the learning factor c ₁ ＝c ₂ =2.01, maximum velocity of particles is 1, number of iterations is 1500, fitness function is:

wherein a is _i ,b _i ,c _i The energy consumption coefficient of each unit is calculated; the particle position and velocity update formula is:

where k represents the number of iteration steps, alpha is the contraction factor,

pbest is the optimal position in the particle history, gbest is the optimal position in all particles, rand () is the random function value range of [0,1 ]]。

Based on the above purpose, the invention also provides a method for the unit scheduling system based on particle swarm and deep reinforcement learning, which comprises the following steps:

s10, optimizing and obtaining the output of all units by utilizing a particle swarm module according to the load demand: with the demand meeting instruction as a target, under the constraint of the force of each unit, reasonably distributing all units by using the particle swarm module, wherein the obtained result is the production value of each unit;

s20, calculating the coal burning amount according to the unit output: calculating according to a conversion formula of the coal-fired quantity and the unit output to obtain an average coal-fired quantity, namely the coal-fired quantity of each degree of electricity, wherein the target coal-fired quantity is the average coal-fired quantity, and the average coal-fired quantity is reduced by at least 1g coal-fired quantity per degree of electricity;

s30, taking the target coal quantity as an input state, taking bottom adjustable equipment parameters such as wind, water, coal and the like as input actions, and inputting the input actions into a deep reinforcement learning model;

s40, taking the target coal quantity as a target, regulating and controlling bottom layer controllable equipment parameters, and obtaining optimal parameters of all bottom layer equipment controllers on the premise of the target coal quantity by the obtained result;

s50, obtaining new coal quantity and cost according to the target coal quantity and unit load demand; and (3) re-utilizing the particle swarm module to plan the output of the unit according to new cost and load requirements, repeating S10-S40, obtaining the optimal parameters of bottom equipment according to the target coal burning quantity, and finally re-planning the output of the unit according to a new cost function.

Preferably, in S40, the target coal amount is used as an input state S, the bottom controllable equipment parameter is used as an input action a, the input state S is input into an evaluation network of the deep reinforcement learning model, the evaluation network autonomously learns to obtain an estimated reward reaching the next state, the Q estimated value is input into a loss function, the input state S, the input action a, the estimated reward r and the next state S 'are combined and input into an experience playback pool, the next state S' is used as an input state to be input into the target network to obtain an actually available reward, namely, the Q target value, and the difference between the Q target value and the Q estimated value is used as feedback of the deep reinforcement learning model to be input into the evaluation network again, so that the learning performance is improved.

Compared with the prior art, the invention has at least the following beneficial effects: the method has the advantages that when the unit scheduling is optimized, from the perspective of saving the coal consumption, the load demand is met, at least 0.1 gram of coal consumption is saved on each degree of electricity, and meanwhile, the control optimization of the integration of the bottom equipment and the unit scheduling control is realized. The invention combines a particle swarm module and a deep reinforcement learning model. The particle swarm module has the characteristics of few parameters, easy realization, global optimum searching and the like, and has universal application on the problem of unit dispatching optimization. The deep reinforcement learning model is a combination of deep learning and reinforcement learning, and the deep learning has stronger perceptibility but lacks a certain decision capability; the reinforcement learning has decision making capability, combines the two to complement advantages, and provides a solution to the problem of complex system perception decision making.

Drawings

FIG. 1 is a block diagram of a group scheduling system based on particle swarm and deep reinforcement learning according to an embodiment of the present invention;

FIG. 2 is a flow chart of steps of a method for scheduling a unit based on particle swarm and deep reinforcement learning according to an embodiment of the present invention;

FIG. 3 is a graph of the variation of the unit output distribution before and after optimization based on the unit scheduling of particle swarm and deep reinforcement learning according to the embodiment of the invention;

FIG. 4 is a schematic diagram of a power plant cost change process in a unit output process of a unit dispatching optimization unit based on particle swarm and deep reinforcement learning according to an embodiment of the invention;

fig. 5 is a schematic diagram of a loss function change process based on a particle swarm and a deep reinforcement learning system according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

On the contrary, the invention is intended to cover any alternatives, modifications, equivalents, and variations as may be included within the spirit and scope of the invention as defined by the appended claims. Further, in the following detailed description of the present invention, certain specific details are set forth in order to provide a better understanding of the present invention. The present invention will be fully understood by those skilled in the art without the details described herein.

System example 1

Referring to fig. 1, a particle swarm and deep reinforcement learning-based crew scheduling system according to an embodiment of the present invention is shown comprising a particle swarm module 10 and a deep reinforcement learning model 20, wherein the deep reinforcement learning model 20 comprises an evaluation network 21, an empirical playback pool 22, a target network 23, and a loss function 24, wherein,

the input of the particle swarm module 10 is the load demand, the output is connected with the evaluation network 21, and the evaluation network 21 outputs the Q estimated value and the experience playback pool 22; the output of the experience playback pool 22 is connected with a target network 23, the target network 23 outputs a Q target value, the Q target value and the Q estimated value are both input into a loss function 24, and the output of the loss function 24 is fed back to the evaluation network 21.

System example 2

The particle swarm module 10 outputs a target coal amount and bottom layer controllable equipment parameters, wherein the target coal amount is used as an input state, and the bottom layer controllable equipment parameters are used as input actions.

The evaluation network 21 outputs to the experience playback pool 22 a target coal amount, bottom layer controllable device parameters, estimated rewards, and next state target coal amount.

The experience playback pool 22 outputs a target coal amount for the next state to the target network 23.

The particle swarm module 10 has a particle number of 80, an inertia weight w=1, and a learning factor c ₁ ＝c ₂ =2.01, maximum velocity of particles is 1, number of iterations is 1500, fitness function is:

Parameter settings in the deep reinforcement learning model 20: deep reinforcement learning is implemented using a fixed Q-network and an experience playback pool 22. Wherein the evaluation network 21 and the target network 23 in the fixed Q-network are 5 hidden layers, each layer is 20 neurons, each 5 steps replace the parameters of the target network, and the activation function is that

The learning rate is 0.01, epsilon-greedy is set to be 0.9, the reward attenuation value gamma is 0.9, the memory storage is 500, the reward rule is +1 if the reward rule is higher than the load requirement, is-1 if the reward rule is lower than the load requirement, is 0 if the reward rule meets the load requirement, the iteration number is 300, and the deep reinforcement learning reward function is designed based on a cost function. The input layer is an observed value (Observation), and the actions (Action) are the opening of a secondary air baffle, the speed b of a coal mill, the rotating speed c of a belt, the opening of a water supply valve d and the power e of a water supply pump. The observation table is shown in Table 1, wherein C isThe average coal burning amount, C-0.1, C-0.2, C-0.3, C-0.4 and C-0.5 are the target average coal burning amount, namely the minimum saving of 0.1g, 0.2g, 0.3g, 0.4g and 0.5g of coal burning amount per degree of electricity.

TABLE 1 watch list

Method embodiment

Referring to fig. 2, a method of the set scheduling system based on particle swarm and deep reinforcement learning includes the following steps:

In the specific embodiment, in S40, the target coal amount is used as an input state S, the bottom controllable equipment parameter is used as an input action a, the input state S is input into an evaluation network of the deep reinforcement learning model, the evaluation network learns autonomously to obtain an estimated reward reaching the next state, the Q estimated value is input into a loss function, the input state S, the input action a, the estimated reward r and the next state S 'are combined and stored in an experience playback pool, the next state S' is used as the input state to be input into the target network to obtain an actually available reward, namely, the Q target value, and the difference between the Q target value and the Q estimated value is used as feedback of the deep reinforcement learning model to be input into the evaluation network again, so that the learning performance is improved.

In the specific embodiment

The output of each unit is optimized according to the load demand by utilizing the particle swarm module, and the obtained simulation effect is shown in fig. 3 and 4.

Fig. 3 is a graph showing changes before and after optimizing the distribution of the machine set output by using the shrinkage factor particle swarm module 10, and the abscissa is the number of the machine sets, and the total number of the machine sets is 40. The ordinate is the output condition of each unit. The black bar graph is the initial output of the unit before optimization, and the white bar graph is the output of the unit after optimization.

FIG. 4 is a graph of the change in plant cost during optimization of unit output. The abscissa is the iteration step number of the method, and the ordinate is the power plant cost. From the figure, it can be seen that as the method iterates, the curve always shows a decreasing trend, and the cost of the power plant is also continuously decreasing.

And calculating the coal burning amount by utilizing the optimized output of each unit according to the following formula:

wherein B is the combustion quantity (t/h) of the boiler, N is the unit output power (MW), Q _net,ar The generated heat of the coal is lower than the generated heat of the standard coal (kJ/kg), 29271 is lower than the generated heat of the standard coal (kJ/kg), f _b The standard coal consumption (g/kWh) is used for power generation.

Dividing the coal consumption by the load requirement to obtain the coal consumption C of each degree of electricity, dividing 5 target coal consumption states (states) which are C-0.1, C-0.2, C-0.3, C-0.4 and C-0.5 respectively, inputting observed values (observation) into a deep reinforcement learning model, wherein the obtained effect is shown in figure 5,

fig. 5 shows a change in the loss function of the deep reinforcement learning model 20, wherein the abscissa indicates the number of learning steps of the deep reinforcement learning, and the ordinate indicates the change in the prediction error in the deep reinforcement learning. Because the input is a constantly exploring process and the input data is obtained from learning, the curve is not a smooth curve.

And (3) the particle swarm module is reused to plan the unit output according to the new coal cost and the original load requirement, so that the optimization of the whole unit scheduling is completed.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, and alternatives falling within the spirit and principles of the invention.

Claims

1. A unit scheduling method based on particle swarm and deep reinforcement learning is characterized in that the unit scheduling system based on the particle swarm and the deep reinforcement learning comprises a particle swarm module and a deep reinforcement learning model, wherein the deep reinforcement learning model comprises an evaluation network, an experience playback pool, a target network and a loss function,

the input of the particle swarm module is load demand, the output of the particle swarm module is connected with the evaluation network, and the evaluation network outputs a Q estimated value and the experience playback pool; the experience playback pool output is connected with the target network, the target network outputs a Q target value, the Q target value and the Q estimated value are both input into the loss function, and the output of the loss function is fed back to the evaluation network;

the particle swarm module outputs target coal quantity and bottom layer controllable equipment parameters, wherein the target coal quantity is used as an input state, and the bottom layer controllable equipment parameters are used as input actions;

the evaluation network outputs target coal burning quantity, bottom controllable equipment parameters, estimated rewards and target coal burning quantity in the next state to the experience playback pool;

the experience playback pool outputs a target coal amount in the next state to the target network;

the particle number in the particle swarm module is 80, the inertia weight w=1, and the learning factor c ₁ ＝c ₂ =2.01, maximum velocity of particles is 1, number of iterations is 1500, fitness function is:

pbest is the optimal position in the particle history, gbest is the optimal position in all particles, rand () is the random function value range of [0,1 ]]；

The method comprises the following steps:

2. The method according to claim 1, wherein in S40, the target coal amount is used as an input state S, the bottom controllable device parameter is used as an input action a, the input state S is input into an evaluation network of the deep reinforcement learning model, the evaluation network autonomously learns to obtain an estimated reward reaching the next state, the Q estimated value is input into a loss function, the input state S, the input action a, the estimated reward r and the next state S 'are combined and stored in an experience playback pool, the next state S' is used as the input state and is input into the target network to obtain an actually available reward, namely, the Q target value, and the difference between the Q target value and the Q estimated value is used as feedback of the deep reinforcement learning model to be input into the evaluation network again, so as to improve learning performance.